openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

Lacp member down after waiting for urcu2 to quiesce. #345

Open danieldin95 opened 1 month ago

danieldin95 commented 1 month ago

OpenvSwitch version

[root@node-2 ~]# ovs-vsctl show | grep ovs_v
    ovs_version: "2.16.2"
[root@node-2 ~]#

We add enp4s0f0np0 and enp2s0f0np0 as lacp bond with fast mode. When ucru thread occured unreasonably long time poll interval, the main thread will be blocked exceed 2000ms, and the member of lacp bond has link state down.

2024-10-08 19:16:16.577 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:16.577Z|29081266|ovs_rcu|WARN|blocked 1000 ms waiting for urcu2 to quiesce
2024-10-08 19:16:17.577 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.577Z|29081267|ovs_rcu|WARN|blocked 2000 ms waiting for urcu2 to quiesce
2024-10-08 19:16:17.858 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.858Z|03663|timeval(urcu2)|WARN|Unreasonably long 2306ms poll interval (2285ms user, 0ms system)
2024-10-08 19:16:17.858 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.858Z|03664|timeval(urcu2)|WARN|faults: 1 minor, 0 major
2024-10-08 19:16:17.858 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.858Z|03665|timeval(urcu2)|WARN|context switches: 3225 voluntary, 39 involuntary
2024-10-08 19:16:17.858 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.858Z|03666|poll_loop(urcu2)|INFO|Dropped 57443 log messages in last 9 seconds (most recently, 5 seconds ago) due to excessive rate
2024-10-08 19:16:17.858 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.858Z|03667|poll_loop(urcu2)|INFO|wakeup due to 0-ms timeout at lib/ovs-rcu.c:236 (67% CPU usage)
2024-10-08 19:16:17.867 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.867Z|29081268|bond|INFO|member enp4s0f0np0: link state down
2024-10-08 19:16:17.867 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.867Z|29081269|bond|INFO|member enp4s0f0np0: disabled
2024-10-08 19:16:17.964 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.964Z|29081270|bond|INFO|member enp4s0f0np0: link state up
2024-10-08 19:16:17.964 +0800 ¦ node-46 ¦ openvswitch ¦  ¦ 2024-10-08T11:16:17.964Z|29081271|bond|INFO|member enp4s0f0np0: enabled
chaudron commented 4 weeks ago

Did it recover on its own, or did you have to restart OVS? If it recovered automatically, the issue is likely due to an overloaded system. Check out the following blog for troubleshooting steps:

https://developers.redhat.com/articles/2024/01/31/debugging-open-vswitch-logs-long-poll-interval-blocked-waiting

danieldin95 commented 3 weeks ago

Thank you @chaudron. But I cann't understand why lacp hasn't beed moved to monitor thread or single thread. In main thread to send the lacp packet is dangerous, we don't know what causes the man thread to get blocked.

danieldin95 commented 2 weeks ago

Hi @chaudron , the long time blocking can recover on its own.

chaudron commented 2 weeks ago

If your system is overloaded the blocking will recover on its own. If for some reason the locking continues it's probably a bug. It could also be that there is a lot of rcu backlog work, but spending 2+ seconds doing this sounds like a lot.

danieldin95 commented 2 weeks ago

Yes @chaudron. But this issue arises in our production environment, and we used ovn for openstack network. In order to be able to carry larger network traffic from tenant network to external network, we enable ovs-dpdk support for gateway node(also network node). Although dpdk improves the throughput of elephant streams and the forwarding latency of a single packet, once a large number (~20W) of small streams (pps < 5) are generated instantly, it still puts considerable load pressure on the revalidator threads, causing the ovs-rcu to spend a lot of time dealing with callback.

chaudron commented 2 weeks ago

How many datapath flows do you have in this scenario, and how busy are the revalidators(ovs-appctl upcall/show)? Are you sure enough CPU resources are available for OVS?

danieldin95 commented 2 weeks ago

The datapath has ~20W flows, and vswitch has ~50W openflows.

[root@node-10 ~]# ovs-ofctl dump-flows br-int | wc
 559508 5030055 128885820
[root@node-10 ~]# ovs-appctl upcall/show | grep flows
  flows         : (current 260044) (avg 260037) (max 704399) (limit 382736)
chaudron commented 2 weeks ago

Does not look too much, what is your dump duration, it was cut off in your output.

danieldin95 commented 2 weeks ago

The dump duration is about ~2000ms, but this is not root cause of lacp link down. Revalidator will exit non-quiesce in each dump batch(50 flows). Would it be more reasonable to implement the sending of lacp messages in a single thread?

chaudron commented 2 weeks ago

Is the ~2000ms the output of the upcall output? Or are you basing this off the log messages? 2000ms for the dump duration sounds way too long to me. You might want to figure out why this is. Do you have some kind of rte offloading that takes time (assuming this is dpdk datapath)? Also, I think you should investigate why the urcu2 thread gets stalled, as callbacks should be quick.

danieldin95 commented 2 weeks ago

Yes, we also seen Unreasonable long time in the log. And we want to know how long the callback took, after applying the following patch, it show ~1500ms.

    long long int start = time_msec();

    LIST_FOR_EACH_POP (cbset, list_node, &cbsets) {
        struct ovsrcu_cb *cb;

        for (cb = cbset->cbs; cb < &cbset->cbs[cbset->n_cbs]; cb++) {
            cb->function(cb->aux);
            call ++;
        }
        free(cbset->cbs);
        free(cbset);
    }

    elapsed = time_msec() - start;
    if (elapsed >= 100) {
        VLOG_WARN("call %d free guarded list with %u ms.", call, elapsed);
    }

So suspected that many OpenFlow rules are associated in a ukey and released the atomic reference count of the rules, which consumed a lot of time cumulatively.

chaudron commented 2 weeks ago

It would be nice to know exactly what function is causing this so you can further analyze what is taking 1.5 seconds, which seems way too long. You could attach a perf probe to your new function. Locate it with perf probe -x /usr/sbin/ovs-vswitchd -L ovsrcu_call_postponed@ovs-rcu.c and then record it with -g --call-graph dwarf so you can see the call graph.

danieldin95 commented 2 weeks ago

Thanks @chaudron, Do you have any suggestions for implementing lacp in a single thread like monitor thread? or What suggestions does the community have? To avoid the main thread getting blocking and affecting the sending of lacp packets?

chaudron commented 2 weeks ago

Your problem is that the RCU thread is being blocked, not the main thread. Is the main thread blocked waiting on this synchronization? Anyway, I think it's better to find the root cause of the delay and fix this, instead of working around the real issue.

danieldin95 commented 2 weeks ago

I understand your point of view, and I will continue to investigate the issue of RCU blocked. But if it is ~55w openflow rules, it also takes a lot of time for dumping all flow rules via ovs-ofctl, and query openflow rules message is also in the main thread.

2024-10-24T10:04:36.859Z|16210|timeval|WARN|Unreasonably long 2085ms poll interval (2079ms user, 0ms system)
2024-10-24T10:04:36.860Z|16211|timeval|WARN|faults: 4444 minor, 0 major
2024-10-24T10:04:36.860Z|16212|timeval|WARN|context switches: 72 voluntary, 9 involuntary
2024-10-24T10:04:36.860Z|16213|coverage|INFO|Skipping details of duplicate event coverage for hash=843564ad
2024-10-24T10:04:36.860Z|16214|poll_loop|INFO|Dropped 150 log messages in last 167 seconds (most recently, 164 seconds ago) due to excessive rate
2024-10-24T10:04:36.860Z|16215|poll_loop|INFO|wakeup due to [POLLIN] on fd 950 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:274 (51% CPU usage)
2024-10-24T10:04:36.863Z|16216|poll_loop|INFO|wakeup due to [POLLOUT] on fd 985 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:153 (51% CPU usage)
2024-10-24T10:04:36.867Z|16217|poll_loop|INFO|wakeup due to [POLLIN] on fd 950 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:274 (51% CPU usage)
2024-10-24T10:04:36.871Z|16218|poll_loop|INFO|wakeup due to [POLLOUT] on fd 987 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:153 (51% CPU usage)
2024-10-24T10:04:36.874Z|16219|poll_loop|INFO|wakeup due to [POLLIN] on fd 987 (/var/run/openvswitch/br-Bond1.mgmt<->) at lib/stream-fd.c:157 (51% CPU usage)

This means that every time we try to dump openflow rules, it will affect the lacp packet transmission. I don't think it's a good implementation.

chaudron commented 2 weeks ago

I have a system here with 200k datapath flows, and dumping takes around 150ms, with peaks to <500ms when revalidating flows. Dumping (open)flows might take time, but the system/main thread is not blocked during this dump.

danieldin95 commented 2 weeks ago

Thank you @chaudron for your reply, can you show me for your cpu info. the following is mine.

Architecture:        aarch64
CPU(s):              96
NUMA node(s):        4
Model:               0
Model name:          Kunpeng-920
CPU max MHz:         2600.0000
CPU min MHz:         200.0000
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95
chaudron commented 2 weeks ago

I have an Intel system:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
    BIOS Model name:     Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
    Stepping:            6
...
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
  NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
danieldin95 commented 6 days ago

Thank you @chaudron for your reply. I'm very sorry for taking so long to reply to you. I have learned about the implementation of lacp in the kernel, as well as dpdk. In the kernel Use work queues to update the lacp state machine and send lacp messages. And In the DPDK also queue the lacp message in a tx-ring of the network port during the timer task, waiting for the DPDK application to call the rte_eth_tx_burst api to empty the lacp buffer queue. So is it possible for the OVS community to implement the sending of lacp messages in a timer task? This design is more conducive for lacp in production environments.

chaudron commented 6 days ago

It doesn’t look like anyone is currently working on this upstream, likely because no one has encountered any issues with it yet. Once your root cause is resolved, you might find that it’s no longer necessary. You could start a discussion on the OVS discuss mailing list to see if others have run into this issue as well.