ovn-org / ovn

Open Virtual Network
Apache License 2.0
497 stars 244 forks source link

ovn-controller is stuck for a long time, causing the port to be unable to come up. #252

Open gugulee opened 2 months ago

gugulee commented 2 months ago

There are a total of 200 to 300 compute nodes online, with approximately 20,000 to 30,000 ports (mainly distributed across two logical switches). Each compute node has about 200,000 OVS flow entries. After the ovn-controller restarts, it gets stuck for a long time (about 8 minutes). during the flow table processing period, the network on this node is unavailable.

ovn-controller logs are as follows:

2024-07-04T02:36:15.012Z|15050873|inc_proc_eng|INFO|node: logical_flow_output, handler for input SB_logical_flow took 2612ms
2024-07-04T02:44:29.806Z|15050874|inc_proc_eng|INFO|node: physical_flow_output, handler for input SB_port_binding took 494793ms
2024-07-04T02:44:30.174Z|15050875|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection dropped (Broken pipe)

the pstackoutput is:

Thread 4 (Thread 0x7f9e321e0700 (LWP 1451710)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb2aefc0 in stopwatch_thread ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 3 (Thread 0x7f9e329e1700 (LWP 1451706)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb297b3a in ovsrcu_postpone_thread ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 2 (Thread 0x7f9e331e2700 (LWP 1451705)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb1db761 in pinctrl_handler ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 1 (Thread 0x7f9e36979a40 (LWP 1451678)):
#0  0x00005614bb2568a7 in mf_format ()
#1  0x00005614bb2656f0 in ofpacts_format ()
#2  0x00005614bb1c9308 in ovn_flow_to_string ()
#3  0x00005614bb1c9888 in ovn_flow_log ()
#4  0x00005614bb1cc6de in ofctrl_add_flow ()
#5  0x00005614bb1ece85 in consider_port_binding ()
#6  0x00005614bb1ee928 in physical_eval_port_binding ()
#7  0x00005614bb1eebb5 in physical_handle_flows_for_lport ()
#8  0x00005614bb1e50cb in pflow_output_sb_port_binding_handler ()
#9  0x00005614bb20da23 in engine_run ()
#10 0x00005614bb1b17e7 in main ()
Thread 4 (Thread 0x7f9e321e0700 (LWP 1451710)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb2aefc0 in stopwatch_thread ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 3 (Thread 0x7f9e329e1700 (LWP 1451706)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb297b3a in ovsrcu_postpone_thread ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 2 (Thread 0x7f9e331e2700 (LWP 1451705)):
#0  0x00007f9e349f5ac1 in poll () from target:/lib64/libc.so.6
#1  0x00005614bb2bacb5 in time_poll ()
#2  0x00005614bb2b009c in poll_block ()
#3  0x00005614bb1db761 in pinctrl_handler ()
#4  0x00005614bb29a293 in ovsthread_wrapper ()
#5  0x00007f9e355611ca in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f9e348fc8d3 in clone () from target:/lib64/libc.so.6
Thread 1 (Thread 0x7f9e36979a40 (LWP 1451678)):
#0  0x00007f9e348fd2fb in socket () from target:/lib64/libc.so.6
#1  0x00007f9e349faa1b in openlog_internal () from target:/lib64/libc.so.6
#2  0x00007f9e349faf0f in __vsyslog_chk () from target:/lib64/libc.so.6
#3  0x00007f9e349fb1c3 in __syslog_chk () from target:/lib64/libc.so.6
#4  0x00005614bb2c7779 in vlog_valist ()
#5  0x00005614bb2c6e43 in vlog ()
#6  0x00005614bb1c98c8 in ovn_flow_log ()
#7  0x00005614bb1cb4e5 in remove_flows_from_sb_to_flow ()
#8  0x00005614bb1cc9ed in ofctrl_remove_flows ()
#9  0x00005614bb1eeba7 in physical_handle_flows_for_lport ()
#10 0x00005614bb1e50cb in pflow_output_sb_port_binding_handler ()
#11 0x00005614bb20da23 in engine_run ()
#12 0x00005614bb1b17e7 in main ()

It seem that the time is mainly consumed in the pflow_output_sb_port_binding_handler function.

I found that after the ovn-controller restarts, all OVS flow tables are cleared, causing network issues for existing ports. I tried using the external_ids:ovn-ofctrl-wait-before-clear option. Although it can prevent the flow tables from being cleared, during this period, the ports assigned to this node cannot come up, and the mac_binding table cannot be updated properly.

In this scenario, how can it be optimized? Are there any other options I might have missed?

almusil commented 1 month ago

Hi,

this sounds like something that might be addressed by https://patchwork.ozlabs.org/project/ovn/patch/20240826131509.202811-1-amusil@redhat.com/. If you have the chance would you mind trying this commit if it helps?