sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[EVPN]When EVPN NVO config arrives later than remote VNI entries, the remote entries don't get added #14949

Open dgsudharsan opened 1 year ago

dgsudharsan commented 1 year ago

Description

Sometime during config reload, EVPN NVO table arrives later than remote VNI table entries. In such scenarios, remote vni entries are ignored and this leads to traffic loss.

2023-04-30.14:23:40.528130|VXLAN_TUNNEL_MAP_TABLE:vtep1:map_98_Vlan98|SET|vlan:Vlan98|vni:98
2023-04-30.14:23:40.559594|VXLAN_TUNNEL_MAP_TABLE:vtep1:map_99_Vlan99|SET|vlan:Vlan99|vni:99
2023-04-30.14:23:40.572133|VXLAN_REMOTE_VNI_TABLE:Vlan98:1.1.1.1|SET|vni:98
2023-04-30.14:23:40.572180|VXLAN_REMOTE_VNI_TABLE:Vlan98:1.1.1.2|SET|vni:98
2023-04-30.14:23:40.575208|VXLAN_FDB_TABLE:Vlan98:04:3f:72:f7:2d:52|SET|remote_vtep:1.1.1.2|type:dynamic|vni:98
2023-04-30.14:23:40.575240|VXLAN_FDB_TABLE:Vlan98:0c:42:a1:6d:5b:94|SET|remote_vtep:1.1.1.1|type:dynamic|vni:98
2023-04-30.14:23:40.575249|VXLAN_FDB_TABLE:Vlan98:1c:34:da:2c:be:00|SET|remote_vtep:1.1.1.1|type:dynamic|vni:98
2023-04-30.14:23:40.575257|VXLAN_FDB_TABLE:Vlan98:1c:34:da:2c:ca:00|SET|remote_vtep:1.1.1.2|type:dynamic|vni:98
2023-04-30.14:23:40.589995|VXLAN_EVPN_NVO_TABLE:nvo1|SET|source_vtep:vtep1

Steps to reproduce the issue:

  1. Configure EVPN
  2. Perform config reload

Describe the results you received:

Remote entries are not added leading to traffic loss

Describe the results you expected:

No issues

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

dgsudharsan commented 1 year ago

The fix added breaks the previously added workaround https://github.com/sonic-net/sonic-swss/pull/2626. Hence requesting to revert the fix. Once we find a proper solution for https://github.com/sonic-net/sonic-buildimage/issues/12361 we need to reintegrate https://github.com/sonic-net/sonic-swss/pull/2756

adyeung commented 1 year ago

@srj102 pls help take a look and share your analysis

srj102 commented 1 year ago

From the Techsupport added in #12361 it looks like VXLAN_EVPN_NVO was not configured leading to the OA not processing the VXLAN_REMOTE_VNI table APP DB entries.

Before the workaround for swss#2626, the case of EVPN_NVO coming later would have been handled via the following check.. " if (!tunnel_orch->getTunnelPort(remote_vtep,tunnelPort)) { SWSS_LOG_WARN("Vxlan tunnelPort doesn't exist: %s", remote_vtep.c_str()); return false; } "

However with the workaround we are seeing this issue.

@dgsudharsan can you please confirm this by removing the workaround made for swss#2626 ? It was agreed that this was a temporary workaround at that time for that specific branch.

dgsudharsan commented 1 year ago

From the Techsupport added in #12361 it looks like VXLAN_EVPN_NVO was not configured leading to the OA not processing the VXLAN_REMOTE_VNI table APP DB entries.

Before the workaround for swss#2626, the case of EVPN_NVO coming later would have been handled via the following check.. " if (!tunnel_orch->getTunnelPort(remote_vtep,tunnelPort)) { SWSS_LOG_WARN("Vxlan tunnelPort doesn't exist: %s", remote_vtep.c_str()); return false; } "

However with the workaround we are seeing this issue.

@dgsudharsan can you please confirm this by removing the workaround made for swss#2626 ? It was agreed that this was a temporary workaround at that time for that specific branch.

@srj102 I don't think removing that workaround alone helps. That work around is not present for p2mp orch. When evpn nvo is not present, we need to retry instead of returning success. My change https://github.com/sonic-net/sonic-swss/pull/2756 did that but it undid the swss#2626.

We have to find proper solution for https://github.com/sonic-net/sonic-buildimage/issues/12361 and we need to reintegrate https://github.com/sonic-net/sonic-swss/pull/2756

srj102 commented 1 year ago

yes for p2mp case the changes made as part of 2756 will be required. p2p works without 2756 as well.

Since 2626 is a workaround with incomplete root causing. I believe it has to be removed from master. Changes made in 2756 is as expected and needs to be in the master and should not be reverted.

dgsudharsan commented 1 year ago

yes for p2mp case the changes made as part of 2756 will be required. p2p works without 2756 as well.

Since 2626 is a workaround with incomplete root causing. I believe it has to be removed from master. Changes made in 2756 is as expected and needs to be in the master and should not be reverted.

@prsunny What is your feedback here? Should we remove the workaround https://github.com/sonic-net/sonic-swss/pull/2626 and reintroduce https://github.com/sonic-net/sonic-swss/pull/2756 in master? Is anyone debugging the root cause of https://github.com/sonic-net/sonic-buildimage/issues/12361 ?

prsunny commented 1 year ago

if we revert 2626, we will still have warmboot issue, right?

dgsudharsan commented 1 year ago

@srj102 Can you please provide ETA for fixing this?