osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

OVN Tunnel Connections between Gateways Nodes tears down by BFD #866

Closed SebastianBiedler closed 8 months ago

SebastianBiedler commented 8 months ago

We have an issue, that tunnel connections between gateway nodes are not working. The BFD protocol tears down the connection. Also the OVS reports issues with the tunnel ports.

tunnel(revalidator122)|WARN|receive tunnel port not found (udp,tun_id=0,tun_src=10.10.27.21,tun_dst=10.10.27.22,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_to s=0,tun_ttl=64,tun_erspan_ver=0,gtpu_flags=0,gtpu_msgtype=0,tun_flags=csum|key,in_port=4,vlan_tci=0x0000,dl_src=e6:41:1d:d3:97:12,dl_dst=00:23:20:00:00:01,nw_src=169.254.1.1,nw_dst=169.254.1.0,nw_tos=192,nw_ecn =0,nw_ttl=255,nw_frag=no,tp_src=49157,tp_dst=3784)

The BFD protocol

|00002|bfd(handler23)|INFO|ovn-net-du-1: BFD state change: down->init "Control Detection Time Expired"->"Control Detection Time Expired". Forwarding: false Detect Multiplier: 3 Concatenated Path Down: false TX Interval: Approx 1000ms RX Interval: Approx 1000ms Detect Time: now +2999ms Next TX Time: now +947ms Last TX Time: now -43ms

Local Flags: none Local Session State: down Local Diagnostic: Control Detection Time Expired Local Discriminator: 0x82198ef7 Local Minimum TX Interval: 1000ms Local Minimum RX Interval: 1000ms

Remote Flags: none Remote Session State: down Remote Diagnostic: No Diagnostic Remote Discriminator: 0xc80ac7ce Remote Minimum TX Interval: 1000ms Remote Minimum RX Interval: 1000ms Remote Detect Multiplier: 3

During a restart of the ovn-controller and OpenVSwtich the tunnel works for a second until the BFD kicks in. The tunnel connections to the Compute Node their are not under surveillance of the BFD protocol are working fine without any problems. Also any problem with the underlying network can be excluded.

osfrickler commented 8 months ago

Not sure whether this will help, but can you list the MTU for the affected networks (overlay and underlay)? Also which versions of OVN, OVS and Neutron are you using?

artificial-intelligence commented 8 months ago

Also any problem with the underlying network can be excluded.

Why can this be excluded? Did you already debug the underlying network and concluded that there is no error there? What where the actual debug steps and what are the results?

Thanks for any update.

SebastianBiedler commented 8 months ago

Hello,

I was able to debug on that problem further. My first assumption that the bfd protocol tear down the tunnel seems to wrong. The customer is using behind a so called master router multiple routers which are connected over a project network that was via rbac rules declared as a provider network. The communication between the master router and the project routers behind over that geneve provider network is not working, when these two routers are not on the same gateway node.

I looked into flow rules and it seems, that ovn not adding any information where the gateway addresses from the project routers can be found. I am not sure if this behavior is normal or if this design with a geneve provider network or rbac rules is working in general. Unfortunately I have no test environment at the moment.

osfrickler commented 8 months ago

That sounds like it may be either a bug in Neutron/OVN or an unsupported use case. Either way I would suggest to open an upstream bug report with information on how to reproduce this.

artificial-intelligence commented 8 months ago

Hello,

I was able to debug on that problem further. My first assumption that the bfd protocol tear down the tunnel seems to wrong. The customer is using behind a so called master router multiple routers which are connected over a project network that was via rbac rules declared as a provider network. The communication between the master router and the project routers behind over that geneve provider network is not working, when these two routers are not on the same gateway node.

I looked into flow rules and it seems, that ovn not adding any information where the gateway addresses from the project routers can be found. I am not sure if this behavior is normal or if this design with a geneve provider network or rbac rules is working in general. Unfortunately I have no test environment at the moment.

I'm not sure, but this sounds a little bit like this upstream bug (the nested router part):

https://bugs.launchpad.net/neutron/+bug/2051935

HTH