sonic-net / SONiC

Landing page for Software for Open Networking in the Cloud (SONiC) - https://sonic-net.github.io/SONiC/
2.23k stars 1.12k forks source link

Mellanox SN2700 SoNIC docker services fail to start, HwSKU "None" causes python error #1114

Open TorrentialFire opened 1 year ago

TorrentialFire commented 1 year ago

Salutations!

We are attempting to run SoNIC on a Mellanox SN2700 switch. Several of the docker services fail to start. With my limited troubleshooting ability, I believe I have discerned that the HwSKU is not being properly detected. Other posts and discussions I have found indicate it might be old firmware to blame, but without access to an MLNX-OS .bin file, I can't switch over to that OS an perform a firmware update. Please correct me if I am wrong, but my understanding is that MLNX-OS is the only way to update the firmware on these devices.

Is there something else wrong, perhaps? Thanks for any assistance in advance! Please let me know if there is any more information I can provide for clarity.

show techsupport dump located here (expires Dec 3, 2022).

admin@sonic:~$ show version

SONiC Software Version: SONiC.master.168762-a31a4e7f8
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: a31a4e7f8
Build date: Wed Nov  2 17:48:12 UTC 2022
Built by: AzDevOps@sonic-build-workers-002BS0

Platform: x86_64-mlnx_x86-r5.0.1410
HwSKU: None
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1702K06506
Model Number: MSN2700-CS2F
Hardware Revision: A2
Uptime: 16:20:34 up 16:51,  2 users,  load average: 0.17, 0.25, 0.18
Date: Thu 03 Nov 2022 16:20:34

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-mlnx             latest                    7a7677abf201   867MB
docker-syncd-mlnx             master.168762-a31a4e7f8   7a7677abf201   867MB
docker-platform-monitor       latest                    c6a6b6ac28c4   875MB
docker-platform-monitor       master.168762-a31a4e7f8   c6a6b6ac28c4   875MB
docker-orchagent              latest                    ae126e856887   486MB
docker-orchagent              master.168762-a31a4e7f8   ae126e856887   486MB
docker-fpm-frr                latest                    0ded2429da04   498MB
docker-fpm-frr                master.168762-a31a4e7f8   0ded2429da04   498MB
docker-teamd                  latest                    c1ddc6677e3b   468MB
docker-teamd                  master.168762-a31a4e7f8   c1ddc6677e3b   468MB
docker-macsec                 latest                    7c7c3b31165f   470MB
docker-dhcp-relay             latest                    7b8a8e3ae7bd   461MB
docker-eventd                 latest                    e655bf03eeb0   451MB
docker-eventd                 master.168762-a31a4e7f8   e655bf03eeb0   451MB
docker-sonic-p4rt             latest                    f072348333dd   534MB
docker-sonic-p4rt             master.168762-a31a4e7f8   f072348333dd   534MB
docker-snmp                   latest                    9938d819ece8   498MB
docker-snmp                   master.168762-a31a4e7f8   9938d819ece8   498MB
docker-database               latest                    420d50b4ee8a   452MB
docker-database               master.168762-a31a4e7f8   420d50b4ee8a   452MB
docker-sonic-telemetry        latest                    8616add6b988   746MB
docker-sonic-telemetry        master.168762-a31a4e7f8   8616add6b988   746MB
docker-router-advertiser      latest                    88c779b21304   452MB
docker-router-advertiser      master.168762-a31a4e7f8   88c779b21304   452MB
docker-mux                    latest                    324ad018c755   500MB
docker-mux                    master.168762-a31a4e7f8   324ad018c755   500MB
docker-lldp                   latest                    3385e2edd2cc   494MB
docker-lldp                   master.168762-a31a4e7f8   3385e2edd2cc   494MB
docker-nat                    latest                    9444b720dd96   439MB
docker-nat                    master.168762-a31a4e7f8   9444b720dd96   439MB
docker-sflow                  latest                    4e13ae56f727   437MB
docker-sflow                  master.168762-a31a4e7f8   4e13ae56f727   437MB
docker-sonic-mgmt-framework   latest                    c21d7367e9a1   570MB
docker-sonic-mgmt-framework   master.168762-a31a4e7f8   c21d7367e9a1   570MB
admin@sonic:~$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS        PORTS     NAMES
5f2575f29394   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   17 hours ago   Up 17 hours             telemetry
fad615c4f175   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   17 hours ago   Up 17 hours             mgmt-framework
f2fb3164424c   docker-lldp:latest                   "/usr/bin/docker-lld…"   17 hours ago   Up 17 hours             lldp
f952e79283dc   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   17 hours ago   Up 17 hours             pmon
ea8619a54edc   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 months ago   Up 2 months             radv
168008e39980   docker-eventd:latest                 "/usr/local/bin/supe…"   2 months ago   Up 2 months             eventd
33cff425ea3b   docker-database:latest               "/usr/local/bin/dock…"   2 months ago   Up 2 months             database
admin@sonic:~$ show platform syseeprom
TlvInfo Header:
   Id String:    TlvInfo
   Version:      1
   Total Length: 584
TLV Name          Code      Len  Value
----------------  ------  -----  ------
Product Name      0x21       64  MSN2700
Part Number       0x22       20  MSN2700-CS2F
Serial Number     0x23       24  MT1702K06506
Base MAC Address  0x24        6  24:8A:07:85:49:00
Manufacture Date  0x25       19  01/12/2017 14:29:58
Device Version    0x26        1  0
Platform Name     0x28       64  x86_64-mlnx_x86-r0
ONIE Version      0x29       32  5.0.1404
MAC Addresses     0x2A        2  128
Manufacturer      0x2B        8  Mellanox
admin@sonic:~$ sudo tail -n 50 /var/log/syslog
Nov  3 16:22:14.004926 sonic NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=1/FAILURE
Nov  3 16:22:14.005107 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Nov  3 16:22:14.009191 sonic INFO systemd[1]: Started Host config enforcer daemon.
Nov  3 16:22:14.009533 sonic NOTICE systemd[1]: switch state service is not active.
Nov  3 16:22:14.009653 sonic WARNING systemd[1]: Dependency failed for SNMP container.
Nov  3 16:22:14.009756 sonic NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Nov  3 16:22:14.012826 sonic NOTICE systemd[1]: switch state service is not active.
Nov  3 16:22:14.012988 sonic WARNING systemd[1]: Dependency failed for SNMP container.
Nov  3 16:22:14.013091 sonic NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Nov  3 16:22:14.300857 sonic INFO hostcfgd: ConfigDB connect success
Nov  3 16:22:14.313485 sonic INFO hostcfgd[78909]: Traceback (most recent call last):
Nov  3 16:22:14.313613 sonic INFO hostcfgd[78909]:   File "/usr/local/bin/hostcfgd", line 1678, in <module>
Nov  3 16:22:14.314269 sonic INFO hostcfgd[78909]:     main()
Nov  3 16:22:14.314367 sonic INFO hostcfgd[78909]:   File "/usr/local/bin/hostcfgd", line 1673, in main
Nov  3 16:22:14.314964 sonic INFO hostcfgd[78909]:     daemon = HostConfigDaemon()
Nov  3 16:22:14.315189 sonic INFO hostcfgd[78909]:   File "/usr/local/bin/hostcfgd", line 1466, in __init__
Nov  3 16:22:14.315573 sonic INFO hostcfgd[78909]:     self.feature_handler = FeatureHandler(self.config_db, feature_state_table, self.device_config)
Nov  3 16:22:14.315797 sonic INFO hostcfgd[78909]:   File "/usr/local/bin/hostcfgd", line 202, in __init__
Nov  3 16:22:14.316121 sonic INFO hostcfgd[78909]:     self._device_running_config = device_info.get_device_runtime_metadata()
Nov  3 16:22:14.316307 sonic INFO hostcfgd[78909]:   File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 478, in get_device_runtime_metadata
Nov  3 16:22:14.316485 sonic INFO hostcfgd[78909]:     port_metadata = {'ETHERNET_PORTS_PRESENT': True if get_path_to_port_config_file(hwsku=None, asic="0" if is_multi_npu() else None) else False}
Nov  3 16:22:14.316666 sonic INFO hostcfgd[78909]:   File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 299, in get_path_to_port_config_file
Nov  3 16:22:14.317551 sonic INFO hostcfgd[78909]:     (platform_path, hwsku_path) = get_paths_to_platform_and_hwsku_dirs()
Nov  3 16:22:14.317846 sonic INFO hostcfgd[78909]:   File "/usr/local/lib/python3.9/dist-packages/sonic_py_common/device_info.py", line 265, in get_paths_to_platform_and_hwsku_dirs
Nov  3 16:22:14.318044 sonic INFO hostcfgd[78909]:     hwsku_path = os.path.join(platform_path, hwsku)
Nov  3 16:22:14.318237 sonic INFO hostcfgd[78909]:   File "/usr/lib/python3.9/posixpath.py", line 90, in join
Nov  3 16:22:14.318418 sonic INFO hostcfgd[78909]:     genericpath._check_arg_types('join', a, *p)
Nov  3 16:22:14.318642 sonic INFO hostcfgd[78909]:   File "/usr/lib/python3.9/genericpath.py", line 152, in _check_arg_types
Nov  3 16:22:14.318833 sonic INFO hostcfgd[78909]:     raise TypeError(f'{funcname}() argument must be str, bytes, or '
Nov  3 16:22:14.319018 sonic INFO hostcfgd[78909]: TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'
Nov  3 16:22:14.348945 sonic NOTICE systemd[1]: hostcfgd.service: Main process exited, code=exited, status=1/FAILURE
Nov  3 16:22:14.349139 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Nov  3 16:22:14.351433 sonic WARNING systemd[1]: hostcfgd.service: Start request repeated too quickly.
Nov  3 16:22:14.351563 sonic WARNING systemd[1]: hostcfgd.service: Failed with result 'exit-code'.
Nov  3 16:22:14.351660 sonic ERR systemd[1]: Failed to start Host config enforcer daemon.
Nov  3 16:22:14.351760 sonic NOTICE systemd[1]: switch state service is not active.
Nov  3 16:22:14.351870 sonic WARNING systemd[1]: Dependency failed for SNMP container.
Nov  3 16:22:14.351964 sonic NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
shivanangi commented 1 year ago

The SwSS is not active, you may want to check the SyncD docker..

Can you share docker ps -a output?

Thanks.

TorrentialFire commented 1 year ago

Here is the output of docker ps --all:

admin@sonic:~$ docker ps --all
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS        PORTS     NAMES
5f2575f29394   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   19 hours ago   Up 19 hours             telemetry
fad615c4f175   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   19 hours ago   Up 19 hours             mgmt-framework
f2fb3164424c   docker-lldp:latest                   "/usr/bin/docker-lld…"   19 hours ago   Up 19 hours             lldp
f952e79283dc   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   19 hours ago   Up 19 hours             pmon
ea8619a54edc   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 months ago   Up 2 months             radv
168008e39980   docker-eventd:latest                 "/usr/local/bin/supe…"   2 months ago   Up 2 months             eventd
33cff425ea3b   docker-database:latest               "/usr/local/bin/dock…"   2 months ago   Up 2 months             database

And here's a grep for occurrences of syncd in the syslog:

admin@sonic:~$ sudo cat /var/log/syslog | grep syncd
Nov  3 14:10:05.769254 sonic ERR monit[453]: 'container_checker' status failed (3) -- Expected containers not running: mux, snmp, dhcp_relay, syncd, swss, teamd, bgp
Nov  3 14:10:06.806861 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"syncd","timestamp":"2022-11-03T14:10:06.806710Z"}}
Nov  3 14:11:05.861950 sonic ERR monit[453]: 'container_checker' status failed (3) -- Expected containers not running: swss, teamd, dhcp_relay, mux, snmp, bgp, syncd
Nov  3 14:11:06.405897 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"syncd","timestamp":"2022-11-03T14:11:06.405217Z"}}
Nov  3 14:12:05.893813 sonic ERR monit[453]: 'container_checker' status failed (3) -- Expected containers not running: mux, syncd, swss, teamd, dhcp_relay, bgp, snmp
Nov  3 14:12:06.444846 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"syncd","timestamp":"2022-11-03T14:12:06.444652Z"}}
Nov  3 14:13:05.925553 sonic ERR monit[453]: 'container_checker' status failed (3) -- Expected containers not running: snmp, swss, teamd, syncd, bgp, mux, dhcp_relay
Nov  3 14:13:06.516344 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"syncd","timestamp":"2022-11-03T14:13:06.515439Z"}}
...
shivanangi commented 1 year ago

Yes, Looks like all the dockers are not running fine. You may want to get a "tested/stable" image from Mellanox Switch Support team. All the essential dockers are crashing.

SAI talks to syncD so technically, anything in the SAI could be the problem.

tryauuum commented 1 year ago

Can you check BIOS version with dmidecode? I had problems with running sonic on SN2700 but BIOS update to 2018 version solved issues (at least sonic doesn't complain now that platform is not supported)