sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
741 stars 1.44k forks source link

Inter-VLAN working EXCEPT for one that will not route to Ethernet0 (NO ACL/ACE) #18883

Open mbze430 opened 6 months ago

mbze430 commented 6 months ago

Description

I did NOT setup any ACL/ACE. But every VLAN can communicate with each other plus Ethernet0; except for one, in my case Vlan53. It was working 3 days ago, and then just stopped.

Ethernet0 is routed 192.168.1.0/29. 192.168.1.2 is Ethernet0 and 192.168.1.1 is up-link to internet.

And this might be related to the other issue where the switch would automatically lock up and reboot?

18871

Steps to reproduce the issue:

random

Describe the results you received:

All the vlan are getting their respected DHCP lease, including Vlan53. They are all able to ping each other in different vlan combination. But any devices in Vlan53 can not ping Ethernet0 192.168.1.2

Describe the results you expected:

every vlan should communicate with each other since there are no ACL/ACE.

Output of show version:

admin@sonic:~$ show ver

SONiC Software Version: SONiC.202311.537841-1d8111206 SONiC OS Version: 11 Distribution: Debian 11.9 Kernel: 5.10.0-23-2-amd64 Build commit: 1d8111206 Build date: Fri May 3 12:22:51 UTC 2024 Built by: AzDevOps@vmss-soni003LFS

Platform: x86_64-cel_seastone-r0 HwSKU: Seastone-DX010 ASIC: broadcom ASIC Count: 1 Serial Number: DX010B2F108423LK100045 Model Number: R0872-F0010-01 Hardware Revision: N/A Uptime: 10:17:30 up 26 min, 1 user, load average: 2.13, 1.85, 1.33 Date: Mon 06 May 2024 10:17:30

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-gbsyncd-broncos 202311.537841-1d8111206 0b33b5395ac8 351MB docker-gbsyncd-broncos latest 0b33b5395ac8 351MB docker-gbsyncd-credo 202311.537841-1d8111206 58237914ace2 324MB docker-gbsyncd-credo latest 58237914ace2 324MB docker-syncd-brcm 202311.537841-1d8111206 514b8888cdfe 715MB docker-syncd-brcm latest 514b8888cdfe 715MB docker-macsec latest 41da2d3f8dae 329MB docker-dhcp-relay latest 3671e2b3167c 310MB docker-orchagent 202311.537841-1d8111206 177f5267aae0 339MB docker-orchagent latest 177f5267aae0 339MB docker-fpm-frr 202311.537841-1d8111206 5b73f1d3b550 359MB docker-fpm-frr latest 5b73f1d3b550 359MB docker-eventd 202311.537841-1d8111206 6eeff2f1cc30 301MB docker-eventd latest 6eeff2f1cc30 301MB docker-nat 202311.537841-1d8111206 0c7ebfdf485a 330MB docker-nat latest 0c7ebfdf485a 330MB docker-sflow 202311.537841-1d8111206 0fa4a48b1334 328MB docker-sflow latest 0fa4a48b1334 328MB docker-teamd 202311.537841-1d8111206 f00b63addfe7 327MB docker-teamd latest f00b63addfe7 327MB docker-platform-monitor 202311.537841-1d8111206 0b57c7f05462 421MB docker-platform-monitor latest 0b57c7f05462 421MB docker-snmp 202311.537841-1d8111206 55ba11023b0a 340MB docker-snmp latest 55ba11023b0a 340MB docker-router-advertiser 202311.537841-1d8111206 d5dd84231b8c 301MB docker-router-advertiser latest d5dd84231b8c 301MB docker-lldp 202311.537841-1d8111206 acec19f65787 343MB docker-lldp latest acec19f65787 343MB docker-sonic-gnmi 202311.537841-1d8111206 6a49929fdb2d 389MB docker-sonic-gnmi latest 6a49929fdb2d 389MB docker-database 202311.537841-1d8111206 f753761666aa 301MB docker-database latest f753761666aa 301MB docker-mux 202311.537841-1d8111206 28175c8e8560 349MB docker-mux latest 28175c8e8560 349MB docker-sonic-mgmt-framework 202311.537841-1d8111206 6c7222bfa1a3 416MB docker-sonic-mgmt-framework latest 6c7222bfa1a3 416MB

Output of show techsupport:

sonic_dump_sonic_20240506_101451.tar.gz

anilkpan commented 6 months ago

@mbze430 In the logs, I see that the system rebooted multiple times every few hours, so the issue might be related to https://github.com/sonic-net/sonic-buildimage/issues/18871. But, it should affect all Vlans though. Does it recover and starts working after the system comes up?

mbze430 commented 6 months ago

@anilkpan This is why it's so strange to me. After reboot, it still the same.

I have done a 'show mac' and it does show all the devices in the problem VLAN. But when I do a 'show arp' none of the devices are shown for the problem vlan. Actually NO devices show up for VLAN53 when doing 'show arp'

The VLAN53 is CIDR /24 submask 255.255.255.0 It's not out of the ordinary.

mbze430 commented 6 months ago

I think it might have to do with dhcp_relay. I build another VLAN. moved all the devices from the old VLAN53 to VLAN268. all the same devices.... basically a mirror of everything that was VLAN53. And it was still broke.

BUT! When I put in a machine with a static IP for the vlan it worked. then I knew it was dhcp_relay on the SONiC switch. I had to restart it three times and then it started working.

Would it be possible that the #18871 could be something is wrong with dhcp_relay?

because I have been having the most trouble getting dhcp_relay working on SONiC even before this issue.

neethajohn commented 6 months ago

@mbze430 , please provide the configuration used

mbze430 commented 6 months ago

config_db.json as requested.

mbze430 commented 6 months ago

I just lost two of my VLAN suddenly. VLAN 52 and VLAN 170

as soon as that happen I tried to log in to the switch and do a 'show techsupport' and it just frozed here.

image

I try to open another ssh session to the switch and it is doing exactly as #18871 now