ovn-org / ovn

Open Virtual Network
Apache License 2.0
497 stars 244 forks source link

Hot upgrading OVN-controller | down time after restarting ovn_controller #247

Closed legitYosal closed 4 months ago

legitYosal commented 4 months ago

We are using OVN with neutron on our openstack cluster, After restarting ovn_controller container we are encountering network shortage on private and public networks of VM's, Also setting ovs-vsctl set open . other_config:flow-restore-wait=true will not affect this although vswitchd restarts will not affect network connectivity. Can someone give a technical explanation on why this happens and possible solutions to upgrade and restart ovn controller container without down time?

dceara commented 4 months ago

CC-ing some random people that might know more about neutron (@booxter @cubeek @danalsan); please feel free to add more if appropriate

What you might want to set instead is: external_ids:ovn-ofctrl-wait-before-clear=<max time before reinstalling ovs flows: https://github.com/ovn-org/ovn/blob/47915c4c517c634dec919cfd60295db0d0bedfa7/controller/ovn-controller.8.xml#L289-L321

As for the explanation, this is just a guess, but I'm assuming in your setup ovn-controller is taking quite long to process the SB database contents so there's a window between the initial OVS flow clear and installing new flows that causes the downtime you're experiencing.

We can't tell without more info:

Hope this helps, Dumitru

booxter commented 4 months ago

To add to what @dceara said, it could help if you clarify what exactly is meant by "network shortage". Some services are provided by ovn-controller controller() action handlers and it's expected that these flows are going to be disrupted during ovn-controller PID restart. These services include IPv6 ND, LB health checks... You can check for more examples in pinctrl.c in ovn repo.

But if you experience complete break down of connectivity and not just for specific services, then it's probably what Dumitru suggested.

FYI for Red Hat OpenStack, we set ovn-ofctrl-wait-before-clear to 8000 (8s) but allow the knob to be tweaked for larger environments if needed.

legitYosal commented 4 months ago

Thank you @dceara , We are using ovn version 22.03 built with ovs 2.17, I am testing on a stage environment deployed on bare metal and I have over-loaded the computes with VMs and therefor sb db is heavy and pulling flows and recompute takes long(about 3 seconds). interestingly if I stop ovn_controller nothing will impact traffic flow, when ovn controller starts, I think it tries to delete all the flows and re-install them, due to excessive logs with verbose mode(1 million lines in 2 3 minutes) I am sending normal logs sectioned into segments in which I lose connection:

### ========> Restart initiated

2024-05-21T13:54:17.724Z|00001|vlog|INFO|opened log file /var/log/kolla/openvswitch/ovn-controller.log
2024-05-21T13:54:17.725Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2024-05-21T13:54:17.725Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2024-05-21T13:54:17.729Z|00004|main|INFO|OVN internal version is : [22.03.0-20.21.0-58.3]
2024-05-21T13:54:17.729Z|00005|main|INFO|OVS IDL reconnected, force recompute.
2024-05-21T13:54:17.729Z|00006|reconnect|INFO|tcp:172.25.0.1:6642: connecting...
2024-05-21T13:54:17.729Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
2024-05-21T13:54:17.729Z|00008|reconnect|INFO|tcp:172.25.0.1:6642: connected
2024-05-21T13:54:20.163Z|00009|features|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2024-05-21T13:54:20.164Z|00010|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2024-05-21T13:54:20.166Z|00011|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2024-05-21T13:54:20.166Z|00012|features|INFO|OVS Feature: ct_zero_snat, state: supported
2024-05-21T13:54:20.166Z|00013|main|INFO|OVS feature set changed, force recompute.
2024-05-21T13:54:20.166Z|00014|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2024-05-21T13:54:20.166Z|00015|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2024-05-21T13:54:20.173Z|00016|main|INFO|OVS feature set changed, force recompute.
2024-05-21T13:54:20.173Z|00017|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2024-05-21T13:54:20.239Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2024-05-21T13:54:20.239Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2024-05-21T13:54:20.282Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected

### ========> Connectivity cut completely

2024-05-21T13:54:27.728Z|00018|memory|INFO|142464 kB peak resident set size after 10.0 seconds
2024-05-21T13:54:27.728Z|00019|memory|INFO|idl-cells:75663 lflow-cache-entries-cache-expr:2 lflow-cache-entries-cache-matches:90 lflow-cache-size-KB:7 local_datapath_usage-KB:1 ofctrl_desired_flow_usage-KB:9170 ofctrl_installed_flow_usage-KB:6180 ofctrl_sb_flow_ref_usage-KB:4852
2024-05-21T13:54:30.433Z|00020|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.440Z|00021|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.446Z|00022|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.453Z|00023|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.460Z|00024|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.466Z|00025|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.473Z|00026|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.479Z|00027|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.486Z|00028|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:30.492Z|00029|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (100% CPU usage)
2024-05-21T13:54:40.546Z|00030|inc_proc_eng|INFO|node: logical_flow_output, handler for input SB_logical_flow took 4151ms
2024-05-21T13:54:40.904Z|00031|timeval|WARN|Unreasonably long 9607ms poll interval (9154ms user, 452ms system)
2024-05-21T13:54:40.905Z|00032|timeval|WARN|faults: 403581 minor, 0 major
2024-05-21T13:54:40.906Z|00033|timeval|WARN|disk: 0 reads, 16 writes
2024-05-21T13:54:40.907Z|00034|timeval|WARN|context switches: 0 voluntary, 34 involuntary
2024-05-21T13:54:40.916Z|00035|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=cf844ca2:
2024-05-21T13:54:40.917Z|00036|coverage|INFO|lflow_run                  0.0/sec     0.033/sec        0.0006/sec   total: 2
2024-05-21T13:54:40.918Z|00037|coverage|INFO|consider_logical_flow      0.0/sec     1.917/sec        0.0319/sec   total: 268680
2024-05-21T13:54:40.919Z|00038|coverage|INFO|lflow_cache_add_expr       0.0/sec     0.033/sec        0.0006/sec   total: 11927
2024-05-21T13:54:40.920Z|00039|coverage|INFO|lflow_cache_add_matches    0.0/sec     1.500/sec        0.0250/sec   total: 11972
2024-05-21T13:54:40.921Z|00040|coverage|INFO|lflow_cache_add            0.0/sec     1.533/sec        0.0256/sec   total: 23899
2024-05-21T13:54:40.922Z|00041|coverage|INFO|lflow_cache_hit            0.0/sec     6.167/sec        0.1028/sec   total: 374
2024-05-21T13:54:40.923Z|00042|coverage|INFO|lflow_cache_miss           0.0/sec     3.233/sec        0.0539/sec   total: 103372
2024-05-21T13:54:40.924Z|00043|coverage|INFO|lflow_conj_alloc           0.0/sec     0.133/sec        0.0022/sec   total: 8
2024-05-21T13:54:40.925Z|00044|coverage|INFO|lflow_conj_free            0.0/sec     0.067/sec        0.0011/sec   total: 4
2024-05-21T13:54:40.926Z|00045|coverage|INFO|physical_run               0.0/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:40.927Z|00046|coverage|INFO|miniflow_malloc            0.0/sec  2392.133/sec       39.8689/sec   total: 191752
2024-05-21T13:54:40.928Z|00047|coverage|INFO|hmap_pathological          2.8/sec     3.983/sec        0.0664/sec   total: 713
2024-05-21T13:54:40.929Z|00048|coverage|INFO|hmap_expand              45756.2/sec  4204.700/sec       70.0783/sec   total: 370540
2024-05-21T13:54:40.930Z|00049|coverage|INFO|txn_unchanged            170.6/sec    14.917/sec        0.2486/sec   total: 1118
2024-05-21T13:54:40.931Z|00050|coverage|INFO|txn_incomplete             0.2/sec     0.067/sec        0.0011/sec   total: 5
2024-05-21T13:54:40.932Z|00051|coverage|INFO|txn_success                0.2/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:40.933Z|00052|coverage|INFO|poll_create_node         1630.4/sec   141.167/sec        2.3528/sec   total: 9156
2024-05-21T13:54:40.933Z|00053|coverage|INFO|poll_zero_timeout          0.0/sec     0.083/sec        0.0014/sec   total: 6
2024-05-21T13:54:40.933Z|00054|coverage|INFO|rconn_queued               0.0/sec   797.900/sec       13.2983/sec   total: 71996
2024-05-21T13:54:40.933Z|00055|coverage|INFO|rconn_sent                 0.0/sec   797.900/sec       13.2983/sec   total: 71996
2024-05-21T13:54:40.933Z|00056|coverage|INFO|seq_change               644.4/sec    55.817/sec        0.9303/sec   total: 3470
2024-05-21T13:54:40.933Z|00057|coverage|INFO|pstream_open               0.0/sec     0.017/sec        0.0003/sec   total: 1
2024-05-21T13:54:40.933Z|00058|coverage|INFO|stream_open                0.0/sec     0.083/sec        0.0014/sec   total: 5
2024-05-21T13:54:40.933Z|00059|coverage|INFO|util_xalloc              2113824.2/sec 215117.967/sec     3585.2994/sec   total: 29013996
2024-05-21T13:54:40.933Z|00060|coverage|INFO|vconn_open                 0.0/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:40.933Z|00061|coverage|INFO|vconn_received             0.6/sec     0.183/sec        0.0031/sec   total: 13
2024-05-21T13:54:40.933Z|00062|coverage|INFO|vconn_sent                 0.0/sec   797.950/sec       13.2992/sec   total: 71999
2024-05-21T13:54:40.933Z|00063|coverage|INFO|netlink_received           0.0/sec     0.383/sec        0.0064/sec   total: 27
2024-05-21T13:54:40.933Z|00064|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.100/sec        0.0017/sec   total: 7
2024-05-21T13:54:40.933Z|00065|coverage|INFO|netlink_sent               0.0/sec     0.383/sec        0.0064/sec   total: 27
2024-05-21T13:54:40.933Z|00066|coverage|INFO|cmap_expand                0.0/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:40.933Z|00067|coverage|INFO|109 events never hit
2024-05-21T13:54:40.933Z|00068|poll_loop|INFO|Dropped 104 log messages in last 10 seconds (most recently, 9 seconds ago) due to excessive rate
2024-05-21T13:54:40.933Z|00069|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/run/openvswitch/db.sock) at lib/stream-fd.c:157 (99% CPU usage)
2024-05-21T13:54:40.933Z|00070|memory|INFO|peak resident set size grew 609% in last 13.2 seconds, from 142464 kB to 1009804 kB
2024-05-21T13:54:40.934Z|00071|memory|INFO|idl-cells:3217711 idl-outstanding-txns:1 lflow-cache-entries-cache-expr:11927 lflow-cache-entries-cache-matches:11972 lflow-cache-size-KB:5338 local_datapath_usage-KB:1 ofctrl_desired_flow_usage-KB:17282 ofctrl_installed_flow_usage-KB:12785 ofctrl_sb_flow_ref_usage-KB:8049 oflow_update_usage-KB:1

### ========> Private network connectivity came back

2024-05-21T13:54:45.508Z|00072|inc_proc_eng|INFO|node: logical_flow_output, recompute ((null)) took 4383ms
2024-05-21T13:54:45.627Z|00073|timeval|WARN|Unreasonably long 4693ms poll interval (4587ms user, 105ms system)
2024-05-21T13:54:45.627Z|00074|timeval|WARN|faults: 74381 minor, 0 major
2024-05-21T13:54:45.627Z|00075|timeval|WARN|disk: 0 reads, 8 writes
2024-05-21T13:54:45.627Z|00076|timeval|WARN|context switches: 0 voluntary, 23 involuntary
2024-05-21T13:54:45.627Z|00077|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=07ea1ea1:
2024-05-21T13:54:45.627Z|00078|coverage|INFO|lflow_run                  0.0/sec     0.033/sec        0.0006/sec   total: 3
2024-05-21T13:54:45.627Z|00079|coverage|INFO|consider_logical_flow    53713.0/sec  4478.000/sec       74.6333/sec   total: 537358
2024-05-21T13:54:45.627Z|00080|coverage|INFO|lflow_cache_add_expr     2385.0/sec   198.783/sec        3.3131/sec   total: 11927
2024-05-21T13:54:45.627Z|00081|coverage|INFO|lflow_cache_add_matches  2376.4/sec   199.533/sec        3.3256/sec   total: 11972
2024-05-21T13:54:45.627Z|00082|coverage|INFO|lflow_cache_add          4761.4/sec   398.317/sec        6.6386/sec   total: 23899
2024-05-21T13:54:45.627Z|00083|coverage|INFO|lflow_cache_hit            0.8/sec     6.233/sec        0.1039/sec   total: 24412
2024-05-21T13:54:45.627Z|00084|coverage|INFO|lflow_cache_miss         20635.6/sec  1722.867/sec       28.7144/sec   total: 182794
2024-05-21T13:54:45.627Z|00085|coverage|INFO|lflow_conj_alloc           0.0/sec     0.133/sec        0.0022/sec   total: 12
2024-05-21T13:54:45.627Z|00086|coverage|INFO|lflow_conj_free            0.0/sec     0.067/sec        0.0011/sec   total: 4
2024-05-21T13:54:45.627Z|00087|coverage|INFO|physical_run               0.0/sec     0.050/sec        0.0008/sec   total: 4
2024-05-21T13:54:45.627Z|00088|coverage|INFO|miniflow_malloc          9644.8/sec  3195.867/sec       53.2644/sec   total: 263741
2024-05-21T13:54:45.627Z|00089|coverage|INFO|hmap_pathological         94.8/sec    11.883/sec        0.1981/sec   total: 1171
2024-05-21T13:54:45.627Z|00090|coverage|INFO|hmap_expand              23651.6/sec  6175.667/sec      102.9278/sec   total: 414318
2024-05-21T13:54:45.628Z|00091|coverage|INFO|txn_unchanged             44.6/sec    18.633/sec        0.3106/sec   total: 1120
2024-05-21T13:54:45.628Z|00092|coverage|INFO|txn_incomplete             0.2/sec     0.083/sec        0.0014/sec   total: 5
2024-05-21T13:54:45.628Z|00093|coverage|INFO|txn_success                0.0/sec     0.050/sec        0.0008/sec   total: 4
2024-05-21T13:54:45.628Z|00094|coverage|INFO|poll_create_node         139.6/sec   152.800/sec        2.5467/sec   total: 9192
2024-05-21T13:54:45.628Z|00095|coverage|INFO|poll_zero_timeout          0.2/sec     0.100/sec        0.0017/sec   total: 6
2024-05-21T13:54:45.628Z|00096|coverage|INFO|rconn_queued             4824.6/sec  1199.950/sec       19.9992/sec   total: 72020
2024-05-21T13:54:45.628Z|00097|coverage|INFO|rconn_sent               4824.6/sec  1199.950/sec       19.9992/sec   total: 72020
2024-05-21T13:54:45.628Z|00098|coverage|INFO|seq_change                24.8/sec    57.883/sec        0.9647/sec   total: 3485
2024-05-21T13:54:45.628Z|00099|coverage|INFO|pstream_open               0.0/sec     0.017/sec        0.0003/sec   total: 1
2024-05-21T13:54:45.628Z|00100|coverage|INFO|stream_open                0.0/sec     0.083/sec        0.0014/sec   total: 5
2024-05-21T13:54:45.628Z|00101|coverage|INFO|util_xalloc              3221389.6/sec 483567.100/sec     8059.4517/sec   total: 33771677
2024-05-21T13:54:45.628Z|00102|coverage|INFO|vconn_open                 0.0/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:45.628Z|00103|coverage|INFO|vconn_received             0.8/sec     0.250/sec        0.0042/sec   total: 18
2024-05-21T13:54:45.628Z|00104|coverage|INFO|vconn_sent               4824.6/sec  1200.000/sec       20.0000/sec   total: 72023
2024-05-21T13:54:45.628Z|00105|coverage|INFO|netlink_received           0.8/sec     0.450/sec        0.0075/sec   total: 31
2024-05-21T13:54:45.628Z|00106|coverage|INFO|netlink_recv_jumbo         0.2/sec     0.117/sec        0.0019/sec   total: 8
2024-05-21T13:54:45.628Z|00107|coverage|INFO|netlink_sent               0.8/sec     0.450/sec        0.0075/sec   total: 31
2024-05-21T13:54:45.628Z|00108|coverage|INFO|cmap_expand                0.0/sec     0.050/sec        0.0008/sec   total: 3
2024-05-21T13:54:45.628Z|00109|coverage|INFO|109 events never hit
2024-05-21T13:54:45.628Z|00110|poll_loop|INFO|Dropped 1 log messages in last 5 seconds (most recently, 5 seconds ago) due to excessive rate
2024-05-21T13:54:45.628Z|00111|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 (172.25.2.3:33726<->172.25.0.1:6642) at lib/stream-fd.c:157 (102% CPU usage)

### ========> Connected every where

Testing with ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=20000 did not result in any change as it seems ovn_controller straightly goes to purge mode when started!

@booxter Also my connectivity test is:

  1. ping -i 0.1 vm1 with public address on stg1-compute2003(this compute is tested) from my device
  2. ping -i 0.1 vm2 which is on a private network on stg1-compute2003 from another compute

when ovn controller purges flows, ping to private interface freezes, and ping to public one timeouts:

64 bytes from x.x.x.x: icmp_seq=47061 ttl=54 time=9.041 ms
64 bytes from x.x.x.x: icmp_seq=47062 ttl=54 time=8.427 ms
Request timeout for icmp_seq 47063
Request timeout for icmp_seq 47064
Request timeout for icmp_seq 47065
.
.
.
Request timeout for icmp_seq 47260
Request timeout for icmp_seq 47261
Request timeout for icmp_seq 47262
64 bytes from x.x.x.x: icmp_seq=47263 ttl=54 time=11.204 ms
64 bytes from x.x.x.x: icmp_seq=47264 ttl=54 time=8.241 ms
legitYosal commented 4 months ago

From reading ovn-ofctrl-wait-before-clear multiple times I think it is saying, setting this option will prevent purging flows before recompute, but because the number of flows are too much, even purging and reinstalling them takes too long:

(ovn-controller)[root@stg1-compute2003 /]# ovs-appctl -t /var/run/openvswitch/ovs-vswitchd.16.ctl bridge/dump-flows br-int | wc -l
71977

Ok this is stage but on production we have hosts almost 80K flows, totally around 200K flows on southbound

So the problem here will not be solvable? could we tell ovn-controller to not purge the flows?

numansiddique commented 4 months ago

In order to solve your issue, ovn-controller should first get the dump of installed flows from ovs-vswitchd on startup and then sync the flows i.e delete or add only the required flows. This is possible, but complicated.

booxter commented 4 months ago

git tag --contains 896adfd2d8b3369110e9618bd190d190105372a9 suggests that support for the ovn-ofctrl-wait-before-clear is from v22.06.0, and you run 22.03.

booxter commented 4 months ago

here's the commit for your reference: https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9

dceara commented 4 months ago

git tag --contains 896adfd2d8b3369110e9618bd190d190105372a9 suggests that support for the ovn-ofctrl-wait-before-clear is from v22.06.0, and you run 22.03.

Actually, the support for that knob has been backported to 22.03 too (for scalability reasons): https://github.com/ovn-org/ovn/commit/4a34b878d02464266c2b7ff2779de121b130e065

It's in there since v22.03.2.

@legitYosal your ovn-controller log says you're running 22.03.0-20.21.0-58.3, could you please upgrade to the latest v22.03.7 and retest? The knob doesn't do anything in the version you're currently running.

hzhou8 commented 4 months ago

From reading ovn-ofctrl-wait-before-clear multiple times I think it is saying, setting this option will prevent purging flows before recompute, but because the number of flows are too much, even purging and reinstalling them takes too long:

Ok this is stage but on production we have hosts almost 80K flows, totally around 200K flows on southbound

So the problem here will not be solvable? could we tell ovn-controller to not purge the flows?

@legitYosal ovn-ofctrl-wait-before-clear should help in your case.

For "purging and reinstalling them takes too long", it is also solved by replacing the flows in OVS bundle - as a single transaction. It is the patch d53c599ed0, which is after the ovn-ofctrl-wait-before-clear patch 896adfd2d8b. However, the patch d53c599ed0 is not in branch-22.03, but only after 22.06. You may try 22.06, or backport to 22.03 by yourself (for backporting you will need e50111213, too)

dceara commented 4 months ago

From reading ovn-ofctrl-wait-before-clear multiple times I think it is saying, setting this option will prevent purging flows before recompute, but because the number of flows are too much, even purging and reinstalling them takes too long:

Ok this is stage but on production we have hosts almost 80K flows, totally around 200K flows on southbound

So the problem here will not be solvable? could we tell ovn-controller to not purge the flows?

@legitYosal ovn-ofctrl-wait-before-clear should help in your case.

For "purging and reinstalling them takes too long", it is also solved by replacing the flows in OVS bundle - as a single transaction. It is the patch d53c599, which is after the ovn-ofctrl-wait-before-clear patch 896adfd. However, the patch d53c599 is not in branch-22.03, but only after 22.06. You may try 22.06, or backport to 22.03 by yourself (for backporting you will need e501112, too)

Actually, both of these are available in branch-22.03: https://github.com/ovn-org/ovn/commit/ebfbedd0ceda723d5f78773c965529ee136a5720 https://github.com/ovn-org/ovn/commit/9a0e90be73af6f9d16765286d1c1734e91bc7d8d

Using the latest v22.03.7 tag should be fine.

hzhou8 commented 4 months ago

Actually, both of these are available in branch-22.03: https://github.com/ovn-org/ovn/commit/ebfbedd0ceda723d5f78773c965529ee136a5720 https://github.com/ovn-org/ovn/commit/9a0e90be73af6f9d16765286d1c1734e91bc7d8d

Using the latest v22.03.7 tag should be fine.

Thanks @dceara for correcting me. I made a mistake when checking the branches.

legitYosal commented 4 months ago

Thank you for sharing your knowledge, I have tested with the ovn-24.03.1 build with ovs-3.3.0, it was working flawlessly as described, on production I will go to v22.03.7 as @dceara mentioned.