Closed vaibhavhd closed 3 years ago
This now does not seem to be an Arista platform issue. The reboo-cause determination fails on other platforms too. The errors seen also seem to be due to Python3 conversion done recently - https://github.com/Azure/sonic-buildimage/pull/5933 https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/determine-reboot-cause#L1
The error trace on Dell clearly points out at the Python3 error due to Python2 code:
Nov 24 22:38:43 sonic determine-reboot-cause: /proc/cmdline indicates reboot type: warm-reboot
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: Traceback (most recent call last):
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: Traceback (most recent call last):
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 236, in <module>
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: main()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: main()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 185, in main
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: (hardware_reboot_cause, additional_reboot_info) = find_hardware_reboot_cause()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 120, in find_hardware_reboot_cause
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: hardware_reboot_cause_major, hardware_reboot_cause_minor = get_reboot_cause_from_platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/bin/determine-reboot-cause", line 107, in get_reboot_cause_from_platform
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: platform = sonic_platform.platform.Platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: platform = sonic_platform.platform.Platform()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/platform.py", line 25, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/platform.py", line 25, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: self._chassis = Chassis()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: self._chassis = Chassis()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 61, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 61, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: self._eeprom = Eeprom()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: self._eeprom = Eeprom()
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: TypeError: ord() expected string of length 1, but int found
Nov 24 22:38:43 sonic determine-reboot-cause[1240]: TypeError: ord() expected string of length 1, but int found
Issue is also seen in 502.
The sonic-platform-common
library has recently made updates to support Python3. As part of the PR https://github.com/Azure/sonic-platform-common/pull/127, read_eeprom()
now returns a bytearray
instead of a string
.
The Dell S6100 error reported above is due to the Dell library is not updated to expect bytearray
eeprom data.
https://github.com/Azure/sonic-buildimage/blob/master/platform/broadcom/sonic-platform-modules-dell/s6100/sonic_platform/eeprom.py#L51
File "/usr/local/lib/python3.7/dist-packages/sonic_platform/eeprom.py", line 51, in __init__
total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
total_length = (ord(eeprom[9]) << 8) | ord(eeprom[10])
TypeError: ord() expected string of length 1, but int found
@jleveque should this fix be dealt by Dell?
Yes. The fix is here: https://github.com/Azure/sonic-buildimage/pull/5960. I was waiting to merge it until after https://github.com/Azure/sonic-buildimage/pull/5926 merges, but the PR has taken longer than expected to be merge-ready. It should now be ready to merge, just waiting on the vsimage test to pass, but since this is causing a regression, I can merge it now and rebase my PR on top of it.
Thanks Joe! I am now looking at Arista issue posted in this issue's description. That seems to have percolated after this commit in Arista repo: https://github.com/aristanetworks/sonic/commit/05a02d74c91fb1b91fbb4602bf8c77067da3ecf2
Particularly this new line that got added: https://github.com/aristanetworks/sonic/blob/master/arista/utils/sonic_platform/sfp.py#L45
Using some extra debug prints I found out that at the time of determine-reboot-cause
(which takes places after rc.local
completion), the kernel I2C buses are not ready. Which causes i2cBusFromName()
to return None
https://github.com/aristanetworks/sonic/blob/master/arista/drivers/scd/driver.py#L29
This happens during driver setup after reboot https://github.com/aristanetworks/sonic/blob/master/arista/drivers/scd/driver.py#L80
Hence self.scd_.i2cOffset
is set to None
which causes problem seen in the logs:
Nov 23 17:10:34 sonic determine-reboot-cause[599]: File "/usr/lib/python3/dist-packages/arista/components/scd.py", line 47, in bus
Nov 23 17:10:34 sonic determine-reboot-cause[599]: return self.scd_.i2cOffset + self.bus_
@jleveque should someone from Arista provide a Fix?
@Staphylo: Can you please investigate the issue in the comment above?
Sure Joe. The error seen above most often happens when the platform is not initialized yet but APIs are being called.
We have a systemd Before=process-reboot-cause
rule to prevent this from happening but since the systemd service was renamed it doesn't apply anymore. We'll work on a fix.
https://github.com/aristanetworks/sonic/blob/3943e69510b471e65797d6054dd5f47cff3012c3/systemd/platform-arista-reboot-cause.service#L5
Just confirming that adding the dependency solves the issue.
However process-reboot-cause.service
now crashes though it's not platform related.
I'll perform a submodule update with the fix in the coming days.
root@gd231:/home/admin# journalctl -u process-reboot-cause.service
-- Logs begin at Wed 2020-11-25 21:40:07 UTC, end at Wed 2020-11-25 21:41:44 UTC. --
Nov 25 21:41:38 gd231 systemd[1]: Started Reboot cause determination service.
Nov 25 21:41:38 gd231 systemd[4489]: process-reboot-cause.service: Failed to execute command: No such file or directory
Nov 25 21:41:38 gd231 systemd[4489]: process-reboot-cause.service: Failed at step EXEC spawning /usr/bin/process-reboot-cause: No such file or directory
Nov 25 21:41:38 gd231 systemd[1]: process-reboot-cause.service: Main process exited, code=exited, status=203/EXEC
Nov 25 21:41:38 gd231 systemd[1]: process-reboot-cause.service: Failed with result 'exit-code'.
root@gd231:/home/admin# journalctl -u determine-reboot-cause.service
-- Logs begin at Wed 2020-11-25 21:40:07 UTC, end at Wed 2020-11-25 21:41:44 UTC. --
Nov 25 21:40:14 gd231 systemd[1]: Started Reboot cause determination service.
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: Starting up...
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: No reboot cause found from /proc/cmdline
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: Platform api returns reboot cause Non-Hardware, reboot, time: 2020-11-25 21:39:37 UTC
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: No reboot cause found from platform api
Nov 25 21:40:14 gd231 determine-reboot-cause[1087]: /host/reboot-cause/reboot-cause.txt indicates the reboot cause: User issued 'reboot' command [User: admin, Time: Wed 25 Nov 2020 09:39:29 PM UTC]
Nov 25 21:40:14 gd231 systemd[1]: determine-reboot-cause.service: Succeeded.
The path for ExecStart
in process-reboot-cause.service
should be /usr/local/bin/process-reboot-cause
.
I just tried it and it seems to be working.
The path for
ExecStart
inprocess-reboot-cause.service
should be/usr/local/bin/process-reboot-cause
. I just tried it and it seems to be working.
@sujinmkang: Can you please open a PR to fix this?
Closing this issue now. The error logged in this issue is not seen in both Arista and Dell platforms. However, the reboot cause determination still is flaky and incorrect (for different errors). I am tracking this in separate tickets now: Dell https://github.com/Azure/sonic-buildimage/issues/6082 More general issue - https://github.com/Azure/sonic-buildimage/issues/6101
Description
Reboot cause determination fails and errors are seen in syslog. The errors seem to be coming due to Python3 script is using Python2 based library from different vendors at https://github.com/Azure/sonic-buildimage/tree/master/platform/broadcom
The Python3 jump for
determine-reboot-cause
andprocess-reboot-cause
has been made with the platform specific lib still based out in Python2.Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected: The
TypeError
should not be thrown and reboot cause should be detected.Additional information you deem important (e.g. issue happens only occasionally):