openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

Ovs-vswitchd is crashed When restart ovs-vswitchd many times. #239

Open renweichun opened 2 years ago

renweichun commented 2 years ago

Dear all, We are using ovs 2.14.1 and dpdk 20.08 on CentOS Linux release 8.3.2011 with kernel of 5.10.44 and glibc of 2.28. When I restart ovs-vswitchd many times, it is observed that ovs-vswitchd is crashed. After some debugging I found. I have two ovs crash problems in context with DPDK. OVS process crashes when I restart ovs-vswitchd First stack trace:

0 ofproto_dpif_credit_table_stats (ofproto=0x3629d20, table_id=0 '\000', n_matches=195, n_misses=0)

at ofproto/ofproto-dpif.c:4350

1 0x00000000010cbcac in xlate_push_stats_entry (entry=0x7fa270030588, stats=0x7fff1ed2cac0, offloaded=)

at ofproto/ofproto-dpif-xlate-cache.c:99

2 0x00000000010cbe7b in xlate_push_stats (xcache=, stats=stats@entry=0x7fff1ed2cac0,

offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181

3 0x00000000010b8e27 in push_dp_ops (udpif=udpif@entry=0x36ace90, ops=ops@entry=0x7fff1ed2cfd0, n_ops=n_ops@entry=1)

at ofproto/ofproto-dpif-upcall.c:2409

4 0x00000000010b9c0e in push_dp_ops (n_ops=n_ops@entry=1, ops=0x7fff1ed2cfd0, ops@entry=0x7fff1ed2b670,

udpif=udpif@entry=0x36ace90) at ofproto/ofproto-dpif-upcall.c:2441

5 push_ukey_ops (udpif=udpif@entry=0x36ace90, umap=umap@entry=0x36b2288, ops=ops@entry=0x7fff1ed2cfd0,

n_ops=n_ops@entry=1) at ofproto/ofproto-dpif-upcall.c:2441

6 0x00000000010b9d8b in dp_purge_cb (aux=0x36ace90, pmd_id=25) at ofproto/ofproto-dpif-upcall.c:2870

7 0x00000000010eb476 in dp_netdev_del_pmd (dp=dp@entry=0x362b110, pmd=pmd@entry=0x7fab57cd8010) at lib/dpif-netdev.c:6555

8 0x00000000010edec7 in reconfigure_pmd_threads (dp=0x362b110) at lib/dpif-netdev.c:5175

9 reconfigure_datapath (dp=dp@entry=0x362b110) at lib/dpif-netdev.c:5266

10 0x00000000010eedbd in do_del_port (dp=0x362b110, port=0x37798f0) at lib/dpif-netdev.c:2287

11 0x00000000010ef287 in dpif_netdev_port_del (dpif=, port_no=27) at lib/dpif-netdev.c:2182

12 0x00000000010f935f in dpif_port_del (dpif=0x325b360, port_no=27, local_delete=local_delete@entry=false)

at lib/dpif.c:631

13 0x00000000010a71b2 in portdestruct (port=0x3781520, del=) at ofproto/ofproto-dpif.c:2147

14 0x00000000010935bb in ofport_destroy (port=0x3781520, del=) at ofproto/ofproto.c:2615

15 0x000000000109b7c0 in ofproto_destroy (p=0x371f120, del=) at ofproto/ofproto.c:1722

16 0x0000000001085a0e in bridge_destroy (br=0x327bf80, del=del@entry=false) at vswitchd/bridge.c:3605

17 0x000000000108a369 in bridge_exit (delete_datapath=) at vswitchd/bridge.c:552

18 0x0000000000573e29 in main (argc=, argv=) at vswitchd/ovs-vswitchd.c:143

(gdb) info reg rax 0x0 0 rbx 0x7fa270030588 140335640675720 rcx 0x0 0 rdx 0xc3 195 rsi 0x0 0 rdi 0x3629d20 56794400 rbp 0x7fff1ed2cac0 0x7fff1ed2cac0 rsp 0x7fff1ed2ca48 0x7fff1ed2ca48 r8 0x0 0 r9 0x3803e90 58736272 r10 0x0 0 r11 0x6 6 r12 0x7fff1ed2cac0 140733710518976 r13 0x70030650 1879246416 r14 0x7fff1ed2cab8 140733710518968 r15 0x7fff1ed2cac0 140733710518976 rip 0x10a8818 0x10a8818 <ofproto_dpif_credit_table_stats+56> eflags 0x10206 [ PF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 k0 0x0 0 k1 0x0 0 k2 0x0 0 k3 0x0 0 k4 0x0 0 k5 0x0 0 k6 0x0 0 k7 0x0 0

(gdb) disassemble 0x10a8818 Dump of assembler code for function ofproto_dpif_credit_table_stats: 0x00000000010a87e0 <+0>: movzbl %sil,%esi 0x00000000010a87e4 <+4>: mov %rsi,%rax 0x00000000010a87e7 <+7>: shl $0x4,%rax 0x00000000010a87eb <+11>: sub %rsi,%rax 0x00000000010a87ee <+14>: shl $0x4,%rax 0x00000000010a87f2 <+18>: add 0x128(%rdi),%rax 0x00000000010a87f9 <+25>: test %rdx,%rdx 0x00000000010a87fc <+28>: jne 0x10a8818 <ofproto_dpif_credit_table_stats+56> 0x00000000010a87fe <+30>: test %rcx,%rcx 0x00000000010a8801 <+33>: jne 0x10a8808 <ofproto_dpif_credit_table_stats+40> 0x00000000010a8803 <+35>: retq 0x00000000010a8804 <+36>: nopl 0x0(%rax) 0x00000000010a8808 <+40>: lock add %rcx,0xe8(%rax) 0x00000000010a8810 <+48>: retq 0x00000000010a8811 <+49>: nopl 0x0(%rax) => 0x00000000010a8818 <+56>: lock add %rdx,0xe0(%rax) 0x00000000010a8820 <+64>: jmp 0x10a87fe <ofproto_dpif_credit_table_stats+30> End of assembler dump.

(gdb) p *ofproto $3 = {all_ofproto_dpifs_by_name_node = {hash = 0, next = 0x3741ba0}, all_ofproto_dpifs_by_uuid_node = {hash = 57992720, next = 0x1051}, up = {hmap_node = {hash = 1163936137340, next = 0x3f0000003f}, ofproto_class = 0xce2b216a, type = 0x0, name = 0x0, fallback_dpid = 0, datapath_id = 0, forward_bpdu = false, mfr_desc = 0x59fdad4700000006 <error: Cannot access memory at address 0x59fdad4700000006>, hw_desc = 0x430a8cdc868fc310 <error: Cannot access memory at address 0x430a8cdc868fc310>, sw_desc = 0x0, serial_desc = 0x7fa26009fd20 "", dp_desc = 0x7fa2600d23d0 "", frag_handling = 1611358112, ports = { buckets = 0x0, one = 0x0, mask = 369490328463343620, n = 2758902708}, port_by_name = {map = {buckets = 0x0, one = 0x7fa2600a4980, mask = 140335372699424, n = 0}}, ofp_requests = {map = {buckets = 0x0, one = 0x0, mask = 1020897070376026114, n = 0}}, alloc_port_no = 0, max_ports = 0, ofport_usage = {buckets = 0x7fa2600fce40, one = 0x0, mask = 0, n = 0}, change_seq = 0, eviction_group_timer = 0, tables = 0x0, n_tables = 0, tables_version = 0, cookies = {buckets = 0x0, one = 0x0, mask = 0, n_unique = 0}, learned_cookies = {buckets = 0xa09f639c00000004, one = 0x2b33f99c, mask = 0, n = 140335372891088}, expirable = {prev = 0x7fa260108cb0, next = 0x0}, meter_features = {max_meters = 0, band_types = 0, capabilities = 0, max_bands = 0 '\000', max_color = 0 '\000'}, meters = { buckets = 0x4b4c0f7a00000002, one = 0x0, mask = 0, n = 140335372886432}, slowpath_meter_id = 0, controller_meter_id = 0, connmgr = 0x0, min_mtu = 0, groups = {impl = {p = 0x0}}, n_groups = {2, 1415127537, 0, 0}, ogf = {types = 0, capabilities = 0, max_groups = {1612239072, 32674, 0, 0}, ofpacts = {0, 0, 0, 0}}, metadata_tab = {p = 0x0}, vl_mff_map = {cmap = {impl = {p = 0x0}}, mutex = {lock = { data = {lock = 0, count = 0, owner = 0, nusers = 0, kind = 0, spins = 0, elision = 0, list = {prev = 0x0, next = 0x0}}, size = '\000' <repeats 39 times>, align = 0}, where = 0xba27fb500000004 <error: Cannot access memory at address 0xba27fb500000004>}}}, backer = 0x906958e3, uuid = {parts = {0, 0, 1610935792, 32674}}, tables_version = 140335372744048, dump_seq = 0, miss_rule = 0x0, no_packet_in_rule = 0x0, drop_frags_rule = 0xd79f750d00000006, netflow = 0xf1991026332d2be0, sflow = 0x0, ipfix = 0x7fa260043830, bundles = {buckets = 0x7fa2600a99a0, one = 0x7fa2601fd6b0, mask = 0, n = 0}, ml = 0xb9971b5200000002, ms = 0x0, has_bonded_bundles = false, lacp_enabled = false, mbridge = 0x7fa260168b50, stats_mutex = {lock = {data = {lock = 0, count = 0, owner = 0, nusers = 0, kind = 0, spins = 0, elision = 0, list = {prev = 0x0, next = 0xc2d8714b00000006}}, size = '\000' <repeats 32 times>, "\006\000\000\000Kq\330", <incomplete sequence \302>, align = 0}, where = 0x3bb82d338c6f5ba6 <error: Cannot access memory at address 0x3bb82d338c6f5ba6>}, stats = {rx_packets = 0, tx_packets = 140335373557152, rx_bytes = 140335373226416, tx_bytes = 140335372593392, rx_errors = 0, tx_errors = 0, rx_dropped = 8378066835995623426, tx_dropped = 0, multicast = 0, collisions = 140335373981728, rx_length_errors = 0, rx_over_errors = 0, rx_crc_errors = 0, rx_frame_errors = 0, rx_fifo_errors = 12410622260254081028, rx_missed_errors = 1186231435, tx_aborted_errors = 0, tx_carrier_errors = 140335373939168, tx_fifo_errors = 140335373424512, tx_heartbeat_errors = 0, tx_window_errors = 0, rx_1_to_64_packets = 0, rx_65_to_127_packets = 0, rx_128_to_255_packets = 0, rx_256_to_511_packets = 0, rx_512_to_1023_packets = 0, rx_1024_to_1522_packets = 0, rx_1523_to_max_packets = 0, tx_1_to_64_packets = 0, tx_65_to_127_packets = 0, tx_128_to_255_packets = 6361799859137150982, tx_256_to_511_packets = 8919369999343897625, tx_512_to_1023_packets = 0, tx_1024_to_1522_packets = 140335373711424, tx_1523_to_max_packets = 140335373447328, tx_multicast_packets = 140335373316560, rx_broadcast_packets = 0, tx_broadcast_packets = 0, rx_undersized_errors = 1507772303897788418, rx_oversize_errors = 0, rx_fragmented_errors = 0, rx_jabber_errors = 0}, stp = 0x0, stp_last_tick = 0, rstp = 0x0, rstp_last_tick = 0, ports = {map = { buckets = 0xf3c911ba00000004, one = 0xe0e96d08, mask = 0, n = 140335373890928}}, ghost_ports = {map = {buckets = 0x7fa26014ed80, one = 0x0, mask = 0, n = 0}}, port_poll_set = {map = {buckets = 0x0, one = 0x0, mask = 0, n = 0}}, port_poll_errno = 0, change_seq = 0, ams = {mutex = {lock = {data = {lock = 0, count = 0, owner = 0, nusers = 0, kind = 4, spins = -26912, elision = -18608, list = {prev = 0xbb62aa2c, next = 0x0}}, size = '\000' <repeats 16 times>, "\004\000\000\000\340\226P\267,\252b\273", '\000' <repeats 11 times>, __align = 0}, where = 0x7fa260074160 ""}, list = {prev = 0x7fa260125520, next = 0x0}, n = 0}, ams_seq = 0x0, ams_seqno = 3765277242901397506, is_controller_connected = false}

Second stack trace: (gdb) bt

0 0xfffffac0e9000000 in ?? ()

1 0x00000000010918df in rule_destroy_cb (rule=0x3942430) at ofproto/ofproto.c:2943

2 0x0000000001175e16 in ovsrcu_call_postponed () at lib/ovs-rcu.c:348

3 0x0000000001175f04 in ovsrcu_postpone_thread (arg=) at lib/ovs-rcu.c:364

4 0x000000000117808d in ovsthreadwrapper (aux=) at lib/ovs-thread.c:383

5 0x00007fd6f0c5a14a in start_thread () from /lib64/libpthread.so.0

6 0x00007fd6efeb3f23 in clone () from /lib64/libc.so.6

(gdb) frame 1

1 0x00000000010918df in rule_destroy_cb (rule=0x3942430) at ofproto/ofproto.c:2943

2943 rule->ofproto->ofproto_class->rule_destruct(rule); (gdb) info reg rax 0x408d00 4230400 rbx 0x3942430 60040240 rcx 0xc11 3089 rdx 0x7fd6b0000080 140560052519040 rsi 0x0 0 rdi 0x3942430 60040240 rbp 0x7fd6b0011be0 0x7fd6b0011be0 rsp 0x7fd6ecb36fb0 0x7fd6ecb36fb0 r8 0x7fd6d8000924 140560723609892 r9 0x7 7 r10 0x2d83b96 47725462 r11 0x206 518 r12 0x7fd6ecb36fc0 140561070911424 r13 0x7fd6ed337ebf 140561079303871 r14 0x7fd6ed337f50 140561079304016 r15 0x7fd6ecb37100 140561070911744 rip 0x10918df 0x10918df <rule_destroy_cb+47> eflags 0x10246 [ PF ZF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 k0 0x0 0 k1 0x0 0 k2 0x0 0 k3 0x0 0 k4 0x0 0 k5 0x0 0 k6 0x0 0 k7 0x0 0 (gdb) disassemble 0x10918df Dump of assembler code for function rule_destroy_cb: 0x00000000010918b0 <+0>: push %rbx 0x00000000010918b1 <+1>: mov %rdi,%rbx 0x00000000010918b4 <+4>: testb $0x1,0xa0(%rdi) 0x00000000010918bb <+11>: je 0x10918cf <rule_destroy_cb+31> 0x00000000010918bd <+13>: cmpb $0x6,0xaa(%rdi) 0x00000000010918c4 <+20>: je 0x10918cf <rule_destroy_cb+31> 0x00000000010918c6 <+22>: cmpl $0xffff,0x18(%rdi) 0x00000000010918cd <+29>: jle 0x1091918 <rule_destroy_cb+104> 0x00000000010918cf <+31>: mov (%rbx),%rax 0x00000000010918d2 <+34>: mov %rbx,%rdi 0x00000000010918d5 <+37>: mov 0x10(%rax),%rax 0x00000000010918d9 <+41>: callq 0x158(%rax) => 0x00000000010918df <+47>: mov (%rbx),%rax 0x00000000010918e2 <+50>: mov 0x118(%rbx),%rsi 0x00000000010918e9 <+57>: lea 0x210(%rax),%rdi 0x00000000010918f0 <+64>: callq 0x1117220 0x00000000010918f5 <+69>: mov (%rbx),%rax 0x00000000010918f8 <+72>: mov 0x120(%rbx),%rsi 0x00000000010918ff <+79>: lea 0x210(%rax),%rdi 0x0000000001091906 <+86>: callq 0x1117220 0x000000000109190b <+91>: mov %rbx,%rdi 0x000000000109190e <+94>: pop %rbx 0x000000000109190f <+95>: jmpq 0x1090b50 0x0000000001091914 <+100>: nopl 0x0(%rax) 0x0000000001091918 <+104>: callq 0x1091760 0x000000000109191d <+109>: jmp 0x10918cf <rule_destroy_cb+31> End of assembler dump. (gdb) p rule $1 = {ofproto = 0x37623e0, cr = {node = {prev = 0xcccccccccccccccc, next = {p = 0x3874bc0}}, priority = 0, cls_match = {p = 0x0}, match = {{{flow = 0x38e9de0, mask = 0x38e9df0}, flows = {0x38e9de0, 0x38e9df0}}, tun_md = 0x0}}, table_id = 0 '\000', state = RULE_REMOVED, mutex = {lock = {data = {lock = 0, count = 0, owner = 0, nusers = 0, kind = 2, spins = 0, elision = 0, list = {prev = 0x0, next = 0x0}}, size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}, where = 0x14ab7e8 ""}, ref_count = {count = 0}, flow_cookie = 5638004435696427197, cookie_node = {hash = 3567688855, d = 0x387b8b8, s = 0x0}, flags = (unknown: 0), hard_timeout = 0, idle_timeout = 0, importance = 0, removed_reason = 2 '\002', eviction_group = 0x0, evg_node = {idx = 0, priority = 0}, actions = 0x3889fa0, meter_list_node = {prev = 0x3942500, next = 0x3942500}, monitor_flags = (unknown: 0), add_seqno = 49, modify_seqno = 49, expirable = {prev = 0x3942528, next = 0x3942528}, created = 95441005, modified = 95441005, match_tlv_bitmap = 0, ofpacts_tlv_bitmap = 0} (gdb) p *rule->ofproto->ofproto_class->rule_destruct Cannot access memory at address 0xfffffac0e9000000 (gdb)

Steps To Reproduce Steps to reproduce the behavior:

systemctl restart openvswitch for many times ... ... Expected behavior There is a ovs crash problem ,which is stack trace as above.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

Notify maintainers ovs-dev@openvswitch.org

junka commented 2 years ago

It seems there is a use-after-free here.

igsilya commented 2 years ago

You might be hitting one of the following two issues: https://patchwork.ozlabs.org/project/openvswitch/patch/20220219032607.15757-1-hepeng.0320@bytedance.com/ https://patchwork.ozlabs.org/project/openvswitch/patch/1638530715-44436-1-git-send-email-wangyunjian@huawei.com/

Both patches are still in work and not accepted yet. OTOH, 2.14.1 is very old at this point. The latest release for 2.14 branch is 2.14.4 which contains about 120 different fixes on top of 2.14.1.