The following output shows an example of a chip misbehaving and not training up causing tt-flash to hang. It's currently not obvious which board/chip is problematic.
ansible@e14cs04:~$ tt-flash --fw-tar /opt/tenstorrent/tt-tools/tt-firmware/fw_pack-80.10.0.0.fwbundle --force
Stage: SETUP
Searching for default sys-config path
Checking /etc/tenstorrent/config.json: not found
Checking ~/.config/tenstorrent/config.json: not found
Could not find config in default search locations, if you need it, either pass it in explicity or generate one
Warning: continuing without sys-config, galaxy systems will not be reset
Stage: DETECT
Detected Chips: 5
(17/300) ARC: another message is queued (0x2c)
(17/300) [0/4] DRAM: Waiting for ARC
(17/900) [0/16] ETH: Waiting for ARC
^CTraceback (most recent call last):
File "/opt/tenstorrent/tt-tools/tt-flash/.venv/bin/tt-flash", line 8, in <module>
sys.exit(main())
File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/main.py", line 232, in main
devices = detect_local_chips(ignore_ethernet=True)
File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/chip.py", line 210, in detect_local_chips
for device in luwen_detect_chips_fallible(
File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/chip.py", line 188, in chip_detect_callback
if sys.stdout.isatty():
KeyboardInterrupt
ansible@e14cs04:~$ tt-smi
Detected C^C
Traceback (most recent call last):
File "/opt/tenstorrent/tt-tools/tt-smi/.venv/bin/tt-smi", line 8, in <module>
sys.exit(main())
File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 773, in main
devices = detect_chips_with_callback(local_only=args.local)
File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 334, in detect_chips_with_callback
for device in detect_chips_fallible(
File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 281, in chip_detect_callback
print(
KeyboardInterrupt
Can there be more verbose logs that print the board ID / bus ID / slot number of the board that is hanging?
This would be useful for all tools if it's part of shared chip detection code.
That's a good idea! Probably don't need to make it option either. I could get away with just adding another line above the detection progress to indicate which chip we are looking at.
The following output shows an example of a chip misbehaving and not training up causing tt-flash to hang. It's currently not obvious which board/chip is problematic.
Can there be more verbose logs that print the board ID / bus ID / slot number of the board that is hanging?
This would be useful for all tools if it's part of shared chip detection code.