Open danieldin95 opened 1 month ago
Did it recover on its own, or did you have to restart OVS? If it recovered automatically, the issue is likely due to an overloaded system. Check out the following blog for troubleshooting steps:
Thank you @chaudron. But I cann't understand why lacp hasn't beed moved to monitor thread or single thread. In main thread to send the lacp packet is dangerous, we don't know what causes the man thread to get blocked.
Hi @chaudron , the long time blocking can recover on its own.
If your system is overloaded the blocking will recover on its own. If for some reason the locking continues it's probably a bug. It could also be that there is a lot of rcu backlog work, but spending 2+ seconds doing this sounds like a lot.
Yes @chaudron. But this issue arises in our production environment, and we used ovn for openstack network. In order to be able to carry larger network traffic from tenant network to external network, we enable ovs-dpdk support for gateway node(also network node). Although dpdk improves the throughput of elephant streams and the forwarding latency of a single packet, once a large number (~20W) of small streams (pps < 5) are generated instantly, it still puts considerable load pressure on the revalidator threads, causing the ovs-rcu to spend a lot of time dealing with callback.
How many datapath flows do you have in this scenario, and how busy are the revalidators(ovs-appctl upcall/show)? Are you sure enough CPU resources are available for OVS?
The datapath has ~20W flows, and vswitch has ~50W openflows.
[root@node-10 ~]# ovs-ofctl dump-flows br-int | wc
559508 5030055 128885820
[root@node-10 ~]# ovs-appctl upcall/show | grep flows
flows : (current 260044) (avg 260037) (max 704399) (limit 382736)
Does not look too much, what is your dump duration, it was cut off in your output.
The dump duration is about ~2000ms, but this is not root cause of lacp link down. Revalidator will exit non-quiesce in each dump batch(50 flows). Would it be more reasonable to implement the sending of lacp messages in a single thread?
Is the ~2000ms the output of the upcall output? Or are you basing this off the log messages? 2000ms for the dump duration sounds way too long to me. You might want to figure out why this is. Do you have some kind of rte offloading that takes time (assuming this is dpdk datapath)? Also, I think you should investigate why the urcu2 thread gets stalled, as callbacks should be quick.
Yes, we also seen Unreasonable long time in the log. And we want to know how long the callback took, after applying the following patch, it show ~1500ms.
long long int start = time_msec();
LIST_FOR_EACH_POP (cbset, list_node, &cbsets) {
struct ovsrcu_cb *cb;
for (cb = cbset->cbs; cb < &cbset->cbs[cbset->n_cbs]; cb++) {
cb->function(cb->aux);
call ++;
}
free(cbset->cbs);
free(cbset);
}
elapsed = time_msec() - start;
if (elapsed >= 100) {
VLOG_WARN("call %d free guarded list with %u ms.", call, elapsed);
}
So suspected that many OpenFlow rules are associated in a ukey and released the atomic reference count of the rules, which consumed a lot of time cumulatively.
It would be nice to know exactly what function is causing this so you can further analyze what is taking 1.5 seconds, which seems way too long. You could attach a perf probe to your new function. Locate it with perf probe -x /usr/sbin/ovs-vswitchd -L ovsrcu_call_postponed@ovs-rcu.c
and then record it with -g --call-graph dwarf
so you can see the call graph.
Thanks @chaudron, Do you have any suggestions for implementing lacp in a single thread like monitor thread? or What suggestions does the community have? To avoid the main thread getting blocking and affecting the sending of lacp packets?
Your problem is that the RCU thread is being blocked, not the main thread. Is the main thread blocked waiting on this synchronization? Anyway, I think it's better to find the root cause of the delay and fix this, instead of working around the real issue.
I understand your point of view, and I will continue to investigate the issue of RCU blocked. But if it is ~55w openflow rules, it also takes a lot of time for dumping all flow rules via ovs-ofctl, and query openflow rules message is also in the main thread.
2024-10-24T10:04:36.859Z|16210|timeval|WARN|Unreasonably long 2085ms poll interval (2079ms user, 0ms system)
2024-10-24T10:04:36.860Z|16211|timeval|WARN|faults: 4444 minor, 0 major
2024-10-24T10:04:36.860Z|16212|timeval|WARN|context switches: 72 voluntary, 9 involuntary
2024-10-24T10:04:36.860Z|16213|coverage|INFO|Skipping details of duplicate event coverage for hash=843564ad
2024-10-24T10:04:36.860Z|16214|poll_loop|INFO|Dropped 150 log messages in last 167 seconds (most recently, 164 seconds ago) due to excessive rate
2024-10-24T10:04:36.860Z|16215|poll_loop|INFO|wakeup due to [POLLIN] on fd 950 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:274 (51% CPU usage)
2024-10-24T10:04:36.863Z|16216|poll_loop|INFO|wakeup due to [POLLOUT] on fd 985 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:153 (51% CPU usage)
2024-10-24T10:04:36.867Z|16217|poll_loop|INFO|wakeup due to [POLLIN] on fd 950 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:274 (51% CPU usage)
2024-10-24T10:04:36.871Z|16218|poll_loop|INFO|wakeup due to [POLLOUT] on fd 987 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:153 (51% CPU usage)
2024-10-24T10:04:36.874Z|16219|poll_loop|INFO|wakeup due to [POLLIN] on fd 987 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:157 (51% CPU usage)
This means that every time we try to dump openflow rules, it will affect the lacp packet transmission. I don't think it's a good implementation.
I have a system here with 200k datapath flows, and dumping takes around 150ms, with peaks to <500ms when revalidating flows. Dumping (open)flows might take time, but the system/main thread is not blocked during this dump.
Thank you @chaudron for your reply, can you show me for your cpu info. the following is mine.
Architecture: aarch64
CPU(s): 96
NUMA node(s): 4
Model: 0
Model name: Kunpeng-920
CPU max MHz: 2600.0000
CPU min MHz: 200.0000
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
I have an Intel system:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel
Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
BIOS Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
Stepping: 6
...
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
Thank you @chaudron for your reply. I'm very sorry for taking so long to reply to you. I have learned about the implementation of lacp in the kernel, as well as dpdk. In the kernel Use work queues to update the lacp state machine and send lacp messages. And In the DPDK also queue the lacp message in a tx-ring of the network port during the timer task, waiting for the DPDK application to call the rte_eth_tx_burst api to empty the lacp buffer queue. So is it possible for the OVS community to implement the sending of lacp messages in a timer task? This design is more conducive for lacp in production environments.
It doesn’t look like anyone is currently working on this upstream, likely because no one has encountered any issues with it yet. Once your root cause is resolved, you might find that it’s no longer necessary. You could start a discussion on the OVS discuss mailing list to see if others have run into this issue as well.
OpenvSwitch version
We add enp4s0f0np0 and enp2s0f0np0 as lacp bond with fast mode. When ucru thread occured unreasonably long time poll interval, the main thread will be blocked exceed 2000ms, and the member of lacp bond has link state down.