sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

determine-reboot-cause fails due to TypeError seen during during boot up #6009

Closed vaibhavhd closed 3 years ago

vaibhavhd commented 3 years ago

Description

Reboot cause determination fails and errors are seen in syslog. The errors seem to be coming due to Python3 script is using Python2 based library from different vendors at https://github.com/Azure/sonic-buildimage/tree/master/platform/broadcom

The Python3 jump for determine-reboot-cause and process-reboot-cause has been made with the platform specific lib still based out in Python2.

Steps to reproduce the issue:

  1. Install latest SONiC master image on Arista device. I hit the issue in 500
  2. Issue is seen in syslog at the beginning stage where DUT is booting into the new image.

Describe the results you received:

Nov 23 17:10:34 sonic root: SONiC version master.500-e0781f46 starting up...
Nov 23 17:10:34 sonic root: SONiC version master.500-e0781f46 starting up...
Nov 23 17:10:34 sonic rc.local[469]: + cut -d , -f2
Nov 23 17:10:34 sonic rc.local[469]: + cut -d , -f2
Nov 23 17:10:34 sonic determine-reboot-cause[599]: Traceback (most recent call last):
Nov 23 17:10:34 sonic determine-reboot-cause[599]: Traceback (most recent call last):
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     main()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     main()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     platform  = sonic_platform.platform.Platform()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     platform  = sonic_platform.platform.Platform()
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/platform.py", line 16, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/platform.py", line 16, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._chassis = Chassis(self._platform)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._chassis = Chassis(self._platform)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/chassis.py", line 48, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/chassis.py", line 48, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._sfp_list[index - 1] = Sfp(index, sfp)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._sfp_list[index - 1] = Sfp(index, sfp)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/sfp.py", line 45, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/utils/sonic_platform/sfp.py", line 45, in __init__
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._eepromPath = EEPROM_PATH.format(sfp.addr.bus, sfp.addr.address)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     self._eepromPath = EEPROM_PATH.format(sfp.addr.bus, sfp.addr.address)
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/components/scd.py", line 47, in bus
Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/components/scd.py", line 47, in bus
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     return self.scd_.i2cOffset + self.bus_
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     return self.scd_.i2cOffset + self.bus_
Nov 23 17:10:34 sonic determine-reboot-cause[599]: TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
Nov 23 17:10:34 sonic determine-reboot-cause[599]: TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Describe the results you expected: The TypeError should not be thrown and reboot cause should be detected.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
$ show version 

SONiC Software Version: SONiC.master.498-5fbb6ee6
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: 5fbb6ee6
Build date: Fri Nov 20 14:36:29 UTC 2020
Built by: johnar@jenkins-worker-3

Platform: x86_64-arista_7260cx3_64
HwSKU: Arista-7260CX3-Q64
ASIC: broadcom
**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
vaibhavhd commented 3 years ago

This now does not seem to be an Arista platform issue. The reboo-cause determination fails on other platforms too. The errors seen also seem to be due to Python3 conversion done recently - https://github.com/Azure/sonic-buildimage/pull/5933 https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/determine-reboot-cause#L1

The error trace on Dell clearly points out at the Python3 error due to Python2 code:

Nov 24 22:38:43 sonic determine-reboot-cause: /proc/cmdline indicates reboot type: warm-reboot
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: Traceback (most recent call last):
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: Traceback (most recent call last):
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     main()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     main()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     platform  = sonic_platform.platform.Platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     platform  = sonic_platform.platform.Platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/platform.py", line 25, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/platform.py", line 25, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     self._chassis = Chassis()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     self._chassis = Chassis()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 61, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 61, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     self._eeprom = Eeprom()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     self._eeprom = Eeprom()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:   File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
Nov 24 22:38:43 sonic determine-reboot-cause[1240]:     total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: TypeError: ord() expected string of length 1, but int found
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: TypeError: ord() expected string of length 1, but int found
vaibhavhd commented 3 years ago

Issue is also seen in 502.

vaibhavhd commented 3 years ago

The sonic-platform-common library has recently made updates to support Python3. As part of the PR https://github.com/Azure/sonic-platform-common/pull/127, read_eeprom() now returns a bytearray instead of a string.

The Dell S6100 error reported above is due to the Dell library is not updated to expect bytearray eeprom data. https://github.com/Azure/sonic-buildimage/blob/master/platform/broadcom/sonic-platform-modules-dell/s6100/sonic_platform/eeprom.py#L51

  File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
    total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
    total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
  TypeError: ord() expected string of length 1, but int found

@jleveque should this fix be dealt by Dell?

jleveque commented 3 years ago

Yes. The fix is here: https://github.com/Azure/sonic-buildimage/pull/5960. I was waiting to merge it until after https://github.com/Azure/sonic-buildimage/pull/5926 merges, but the PR has taken longer than expected to be merge-ready. It should now be ready to merge, just waiting on the vsimage test to pass, but since this is causing a regression, I can merge it now and rebase my PR on top of it.

vaibhavhd commented 3 years ago

Thanks Joe! I am now looking at Arista issue posted in this issue's description. That seems to have percolated after this commit in Arista repo: https://github.com/aristanetworks/sonic/commit/05a02d74c91fb1b91fbb4602bf8c77067da3ecf2

Particularly this new line that got added: https://github.com/aristanetworks/sonic/blob/master/arista/utils/sonic_platform/sfp.py#L45

Using some extra debug prints I found out that at the time of determine-reboot-cause (which takes places after rc.local completion), the kernel I2C buses are not ready. Which causes i2cBusFromName() to return None https://github.com/aristanetworks/sonic/blob/master/arista/drivers/scd/driver.py#L29

This happens during driver setup after reboot https://github.com/aristanetworks/sonic/blob/master/arista/drivers/scd/driver.py#L80

Hence self.scd_.i2cOffset is set to None which causes problem seen in the logs:

Nov 23 17:10:34 sonic determine-reboot-cause[599]:   File "/usr/lib/python3/dist-packages/arista/components/scd.py", line 47, in bus
Nov 23 17:10:34 sonic determine-reboot-cause[599]:     return self.scd_.i2cOffset + self.bus_

@jleveque should someone from Arista provide a Fix?

jleveque commented 3 years ago

@Staphylo: Can you please investigate the issue in the comment above?

Staphylo commented 3 years ago

Sure Joe. The error seen above most often happens when the platform is not initialized yet but APIs are being called. We have a systemd Before=process-reboot-cause rule to prevent this from happening but since the systemd service was renamed it doesn't apply anymore. We'll work on a fix. https://github.com/aristanetworks/sonic/blob/3943e69510b471e65797d6054dd5f47cff3012c3/systemd/platform-arista-reboot-cause.service#L5

Staphylo commented 3 years ago

Just confirming that adding the dependency solves the issue. However process-reboot-cause.service now crashes though it's not platform related. I'll perform a submodule update with the fix in the coming days.

root@gd231:/home/admin# journalctl -u process-reboot-cause.service
-- Logs begin at Wed 2020-11-25 21:40:07 UTC, end at Wed 2020-11-25 21:41:44 UTC. --
Nov 25 21:41:38 gd231 systemd[1]: Started Reboot cause determination service.
Nov 25 21:41:38 gd231 systemd[4489]: process-reboot-cause.service: Failed to execute command: No such file or directory
Nov 25 21:41:38 gd231 systemd[4489]: process-reboot-cause.service: Failed at step EXEC spawning /usr/bin/process-reboot-cause: No such file or directory
Nov 25 21:41:38 gd231 systemd[1]: process-reboot-cause.service: Main process exited, code=exited, status=203/EXEC
Nov 25 21:41:38 gd231 systemd[1]: process-reboot-cause.service: Failed with result 'exit-code'.
root@gd231:/home/admin# journalctl -u determine-reboot-cause.service 
-- Logs begin at Wed 2020-11-25 21:40:07 UTC, end at Wed 2020-11-25 21:41:44 UTC. --
Nov 25 21:40:14 gd231 systemd[1]: Started Reboot cause determination service.
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: Starting up...
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: No reboot cause found from /proc/cmdline
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: Platform api returns reboot cause Non-Hardware, reboot, time: 2020-11-25 21:39:37 UTC
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: No reboot cause found from platform api
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: /host/reboot-cause/reboot-cause.txt indicates the reboot cause: User issued 'reboot' command [User: admin, Time: Wed 25 Nov 2020 09:39:29 PM UTC]
Nov 25 21:40:14 gd231 systemd[1]: determine-reboot-cause.service: Succeeded.
Staphylo commented 3 years ago

The path for ExecStart in process-reboot-cause.service should be /usr/local/bin/process-reboot-cause. I just tried it and it seems to be working.

jleveque commented 3 years ago

The path for ExecStart in process-reboot-cause.service should be /usr/local/bin/process-reboot-cause. I just tried it and it seems to be working.

@sujinmkang: Can you please open a PR to fix this?

vaibhavhd commented 3 years ago

Closing this issue now. The error logged in this issue is not seen in both Arista and Dell platforms. However, the reboot cause determination still is flaky and incorrect (for different errors). I am tracking this in separate tickets now: Dell https://github.com/Azure/sonic-buildimage/issues/6082 More general issue - https://github.com/Azure/sonic-buildimage/issues/6101