Open lukasstockner opened 2 years ago
Could you do the following:
swssloglevel -l INFO -a config vxlan map delete vtep 1000 10000 tail -f /var/log/syslog & config vxlan map add vtep 1000 10000
and send the logs showing the vxlan tunnel being created
I am facing an issue here, might be related
@aseaudi I'm currently testing other images on the switches, but I think the following section from the log in the techsupport dump is what you want:
Feb 21 20:09:59.931907 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_process_route_add_mode_host_only:733 SAI Enter _brcm_sai_mptnl_process_route_add_mode_host_only
Feb 21 20:09:59.931907 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_process_route_add_mode_host_only:779 SAI Exit _brcm_sai_mptnl_process_route_add_mode_host_only
Feb 21 20:09:59.935666 localhost WARNING swss#orchagent: :- createTunnelHw: creation src = 1
Feb 21 20:09:59.935775 localhost NOTICE swss#orchagent: :- create_tunnel: create_tunnel:encapmaplist[0]=0x29000000000613
Feb 21 20:09:59.935904 localhost NOTICE swss#orchagent: :- create_tunnel: create_tunnel:encapmaplist[1]=0x29000000000615
Feb 21 20:09:59.937556 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:brcm_sai_tnl_mp_create_tunnel:3049 SAI Enter brcm_sai_tnl_mp_create_tunnel
Feb 21 20:09:59.937556 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:brcm_sai_tnl_mp_create_tunnel:3138 Setting peer_mode to 0
Feb 21 20:09:59.937556 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:brcm_sai_tnl_mp_create_tunnel:3285 Created tunnel id: 2
Feb 21 20:09:59.940214 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_route_dst_tnl_cnt:911 SAI Enter _brcm_sai_mptnl_route_dst_tnl_cnt
Feb 21 20:09:59.940214 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:mptnl_xgs_flexflow_create_sipdip_tnl:1696 SDK dscp_mode(UNIFORM)
Feb 21 20:09:59.940214 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:mptnl_xgs_flexflow_create_sipdip_tnl:1710 SDK ttl_mode(UNIFORM)
Feb 21 20:09:59.940214 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:mptnl_xgs_flexflow_create_sipdip_tnl:1775 tunnel_id (1275068419) flags (0) valid_elements (909) dscp_sel (0x1) dscp (0)
Feb 21 20:09:59.941787 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_process_tnl_route_add_tunnel_event:646 SAI Enter _brcm_sai_mptnl_process_tnl_route_add_tunnel_event
Feb 21 20:09:59.941787 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_find_external_best_route:596 SAI Enter _brcm_sai_mptnl_find_external_best_route
Feb 21 20:09:59.941916 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_find_external_best_route:634 SAI Exit _brcm_sai_mptnl_find_external_best_route
Feb 21 20:09:59.941916 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_route_add_dip_tunnel:142 SAI Enter _brcm_sai_mptnl_route_add_dip_tunnel
Feb 21 20:09:59.941916 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_route_add_dip_tunnel:212 SAI Exit _brcm_sai_mptnl_route_add_dip_tunnel
Feb 21 20:09:59.942038 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_tnl_route_event_add:391 SAI Enter _brcm_sai_mptnl_tnl_route_event_add
Feb 21 20:09:59.943761 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_tnl_route_event_add:538 SAI Exit _brcm_sai_mptnl_tnl_route_event_add
Feb 21 20:09:59.943761 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:_brcm_sai_mptnl_process_tnl_route_add_tunnel_event:668 SAI Exit _brcm_sai_mptnl_process_tnl_route_add_tunnel_event
Feb 21 20:09:59.943761 localhost INFO syncd#syncd: [none] SAI_API_TUNNEL:brcm_sai_tnl_mp_create_tunnel:3313 SAI Exit brcm_sai_tnl_mp_create_tunnel
Feb 21 20:09:59.944839 localhost NOTICE swss#orchagent: :- createDynamicDIPTunnel: Created P2P Tunnel remote IP 1.1.1.1
Feb 21 20:09:59.944839 localhost NOTICE swss#orchagent: :- addTunnelUser: diprefcnt for remote 1.1.1.1 = 1
Feb 21 20:09:59.946139 localhost NOTICE swss#orchagent: :- addBridgePort: Add bridge port Port_EVPN_1.1.1.1 to default 1Q bridge
Feb 21 20:09:59.946297 localhost ERR swss#orchagent: :- meta_sai_on_port_state_change_single: data.port_id oid:0x2a00000000061a has unexpected type: SAI_OBJECT_TYPE_TUNNEL, expected PORT, BRIDGE_PORT or LAG
Feb 21 20:09:59.947895 localhost NOTICE swss#orchagent: :- addVlanMember: Add member Port_EVPN_1.1.1.1 to VLAN Vlan1000 vid:1000 pid0
Feb 21 20:09:59.947895 localhost ERR swss#orchagent: :- setPortPvid: pvid setting for tunnel Port_EVPN_1.1.1.1 is not allowed
Feb 21 20:09:59.948454 localhost INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_table_add:51 fdbEvent:FDB table add: MAC:D8-5E-D3-84-76-7F vfi 0x73e8 port:0x2a043a00000002 vlan:1000 is_static 0 is_remote 128
Feb 21 20:09:59.948454 localhost INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:D8-5E-D3-84-76-7F port_tid:0xb0000003 port_type:Port vid:0x73e8
Feb 21 20:09:59.949356 localhost INFO syncd#syncd: [none] SAI_API_FDB:_brcm_sai_fdb_table_add:51 fdbEvent:FDB table add: MAC:D8-5E-D3-84-76-8F vfi 0x73e8 port:0x2a043a00000002 vlan:1000 is_static 0 is_remote 128
Feb 21 20:09:59.949415 localhost INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_create_fdb_entry:707 FDB Create: MAC:D8-5E-D3-84-76-8F port_tid:0xb0000003 port_type:Port vid:0x73e8
Feb 21 20:09:59.949852 localhost NOTICE swss#orchagent: :- doTask: Get port state change notification id:2a00000000061a status:1
Feb 21 20:09:59.949988 localhost ERR swss#orchagent: :- doTask: Failed to get port object for port id 0x2a00000000061a
For me, the oper_down
error could be fixed by applying https://github.com/Azure/sonic-swss/pull/2080, maybe also try that?
The ARP problem still persists, though.
@lukasstockner you have the error: ERR swss#orchagent: :- meta_sai_on_port_state_change_single: data.port_id oid:0x2a00000000061a has unexpected type: SAI_OBJECT_TYPE_TUNNEL, expected PORT, BRIDGE_PORT or LAG what happens after that, does the VXLAN appear in the "show vxlan remotevtep" output ? what is the output of "bridge fdb show br Bridge" ? in my case, after the error, i get a log saying orchagent exiting, and the swss container restarts a couple of times and finally fails.
@aseaudi After the log snippet that I posted above, swss/orchagent keeps running in my case and the tunnel is working (except for the ARP issue). See the full log in the techsupport dump for details.
show vxlan remotevtep
shows oper_down
for me, unless I apply the PR that I linked above - in that case, it correctly shows oper_up
instead with no actual change to the tunnel behavior.
bridge fdb show br Bridge
shows
b0:26:28:35:0e:01 dev Ethernet128 vlan 1000 master Bridge
33:33:00:00:00:01 dev Ethernet128 self permanent
33:33:00:00:00:02 dev Ethernet128 self permanent
01:00:5e:00:00:01 dev Ethernet128 self permanent
33:33:ff:97:21:ce dev Ethernet128 self permanent
33:33:ff:00:00:00 dev Ethernet128 self permanent
01:80:c2:00:00:0e dev Ethernet128 self permanent
01:80:c2:00:00:03 dev Ethernet128 self permanent
01:80:c2:00:00:00 dev Ethernet128 self permanent
33:33:00:00:00:01 dev Bridge self permanent
33:33:00:00:00:02 dev Bridge self permanent
01:00:5e:00:00:01 dev Bridge self permanent
33:33:ff:4f:8e:84 dev Bridge self permanent
33:33:ff:00:00:00 dev Bridge self permanent
01:80:c2:00:00:21 dev Bridge self permanent
33:33:ff:97:21:ce dev Bridge self permanent
0c:48:c6:97:21:ce dev Bridge vlan 1000 master Bridge permanent
0c:48:c6:97:21:ce dev Bridge master Bridge permanent
7e:f9:d1:ab:97:1d dev dummy vlan 1 master Bridge permanent
7e:f9:d1:ab:97:1d dev dummy master Bridge permanent
33:33:00:00:00:01 dev dummy self permanent
b0:26:28:35:0e:00 dev vtep-1000 vlan 1000 extern_learn master Bridge
00:00:00:00:00:00 dev vtep-1000 dst 2.2.2.2 self permanent
b0:26:28:35:0e:00 dev vtep-1000 dst 2.2.2.2 self extern_learn
Adam will find someone in BRCM to take a look. Thanks.
I was troubleshooting the same issue on my edgecore as8535-54x with sonic 202012, and i noticed that the arp packet is encapsulated in a vxlan packet with ttl=0 and was dropped by the next switch en route to the end vtep.
So, the ARP was dropped by the intermediate switch.
I don't know if this is normal, or if this is something configurable in the sonic.
16:32:52.234085 IP (tos 0x0, id 2994, offset 0, flags [none], proto UDP (17), length 96)
[4.4.4.4](http://4.4.4.4/).61446 > 2.2.2.2.4789: [no cksum] VXLAN, flags [I] (0x08), vni 50
ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.50.18 tell 192.168.50.11, length 46
16:32:52.234183 IP (tos 0xc0, ttl 64, id 16733, offset 0, flags [none], proto ICMP (1), length 124)
10.3.4.3 > 4.4.4.4: ICMP time exceeded in-transit, length 104
IP (tos 0x0, id 2994, offset 0, flags [none], proto UDP (17), length 96)
4.4.4.4.61446 > 2.2.2.2.4789: [no cksum] VXLAN, flags [I] (0x08), vni 50
ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.50.18 tell 192.168.50.11, length 46
Ipv6 is encapsulate in VXLAN with TTL = 64
21:40:56.806808 IP (tos 0x0, ttl 64, id 61067, offset 0, flags [none], proto UDP (17), length 122)
4.4.4.4.40641 > 2.2.2.2.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 50
IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2001::1 > ff02::1:ff00:2: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2001::2
source link-address option (1), length 8 (1): f8:8e:a1:e0:72:11
0x0000: f88e a1e0 7211
21:40:57.830709 IP (tos 0x0, ttl 64, id 61141, offset 0, flags [none], proto UDP (17), length 122)
4.4.4.4.40641 > 2.2.2.2.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 50
IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2001::1 > ff02::1:ff00:2: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2001::2
source link-address option (1), length 8 (1): f8:8e:a1:e0:72:11
0x0000: f88e a1e0 7211
We too had issues getting ARP to work a while back but gave up debugging it. See https://github.com/kamelnetworks/sonic/issues/9 for our own notes. Basically IPv4 w/ static ARP and IPv6 worked fine, ARP did not. We assumed it was something to do with ARP suppression at the time, but that was just a hunch.
I changed the VXLAN tunnel attribute in orchagnet from the default UNIFROM_MODEL to PIPE_MODEL with TTL = 64, and now ARP and Ping is working over the P2P Vxlan Tunnel.
attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_MODE;
attr.value.s32 = SAI_TUNNEL_TTL_MODE_PIPE_MODEL;
tunnel_attrs.push_back(attr);
attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_VAL;
attr.value.u8 = 64;
tunnel_attrs.push_back(attr);
@aseaudi Fantastic find, thanks! I can confirm that that change makes it work.
Looks like the code already supports specifying an encap_ttl
, but all callers just leave it at zero.
This issue does still exist, is there a way to define encap_ttl somewhere for P2P tunnels? Or is altering the source still the way to go? I'm experiencing it on two Dell 5248 with TD3 ASIC.
I changed the VXLAN tunnel attribute in orchagnet from the default UNIFROM_MODEL to PIPE_MODEL with TTL = 64, and now ARP and Ping is working over the P2P Vxlan Tunnel.
attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_MODE; attr.value.s32 = SAI_TUNNEL_TTL_MODE_PIPE_MODEL; tunnel_attrs.push_back(attr); attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_VAL; attr.value.u8 = 64; tunnel_attrs.push_back(attr);
Sorry, where is the directory to change ttl? thanks
sonic-swss>orchagent>vxlanorch.cpp in the source before compiling.
Description
When configuring a L2 VXLAN-EVPN overlay on Celestica Seastone2 (Trident3-based) switches in a simple test setup, ARP packets don't reach the server on the other switch.
IPv4 unicast traffic works just fine after manually adding the ARP entries on the servers, IPv6 (including ND) works just fine out of the box, and other broadcast traffic (e.g. a simple ping to the broadcast address) also arrives, so the problem appears to be related to ARP itself, not BUM traffic in general.
Steps to reproduce the issue:
Describe the results you received:
IPv6 pings to link-local and manually configured addresses work just fine, while IPv4 pings fail due to not receiving an ARP reply. After adding a static ARP table entry on both servers, IPv4 pings also work just fine, and traffic between the servers flows at line rate.
Describe the results you expected:
Both IPv4 and IPv6 should work.
Output of
show version
:Output of
show techsupport
:sonic_dump_localhost_20220221_201100.tar.gz
Additional information you deem important (e.g. issue happens only occasionally):
Since a few other issues mention that L2VPN worked for them, I've tried several older versions going back to July 2021, but all of them had the same problem.
202111 and master have a different issue that prevents the tunnel from coming up at all, I'll create a separate issue for that.
The logs contain an error related to setting the port status - backporting https://github.com/Azure/sonic-swss/pull/2080 fixes this, but the ARP problem remains.
Giving the switches an IP on the VLAN makes the ARP requests in question show up on the Linux interfaces when using tcpdump, but there's no ARP reply to be seen.
From spamming ARP requests from one server and checking port counters, it appears that the destination switch receives the packets and drops them instead of decapping and sending them to the second server.
The image I'm running for the output above is based on 7a35504ff, with a few platform-related fixes that I still have to upstream. None of them should have any impact on the dataplane, it's just Python platform module stuff.