tenstorrent / tt-flash

Tenstorrent Firmware Update Utility
Apache License 2.0
9 stars 5 forks source link

Add verbose logging to tt-flash / tt-smi to identify faulty or hung chips #14

Open hmohiuddinTT opened 1 month ago

hmohiuddinTT commented 1 month ago

The following output shows an example of a chip misbehaving and not training up causing tt-flash to hang. It's currently not obvious which board/chip is problematic.

ansible@e14cs04:~$ tt-flash --fw-tar /opt/tenstorrent/tt-tools/tt-firmware/fw_pack-80.10.0.0.fwbundle --force
Stage: SETUP
        Searching for default sys-config path
        Checking /etc/tenstorrent/config.json: not found
        Checking ~/.config/tenstorrent/config.json: not found

        Could not find config in default search locations, if you need it, either pass it in explicity or generate one
        Warning: continuing without sys-config, galaxy systems will not be reset
Stage: DETECT
Detected Chips: 5
(17/300) ARC: another message is queued (0x2c)
(17/300) [0/4] DRAM: Waiting for ARC
(17/900) [0/16] ETH: Waiting for ARC
^CTraceback (most recent call last):
  File "/opt/tenstorrent/tt-tools/tt-flash/.venv/bin/tt-flash", line 8, in <module>
    sys.exit(main())
  File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/main.py", line 232, in main
    devices = detect_local_chips(ignore_ethernet=True)
  File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/chip.py", line 210, in detect_local_chips
    for device in luwen_detect_chips_fallible(
  File "/opt/tenstorrent/tt-tools/tt-flash/.venv/lib/python3.8/site-packages/tt_flash/chip.py", line 188, in chip_detect_callback
    if sys.stdout.isatty():
KeyboardInterrupt

ansible@e14cs04:~$ tt-smi
 Detected C^C
Traceback (most recent call last):
  File "/opt/tenstorrent/tt-tools/tt-smi/.venv/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 773, in main
    devices = detect_chips_with_callback(local_only=args.local)
  File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 334, in detect_chips_with_callback
    for device in detect_chips_fallible(
  File "/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 281, in chip_detect_callback
    print(
KeyboardInterrupt

Can there be more verbose logs that print the board ID / bus ID / slot number of the board that is hanging?

This would be useful for all tools if it's part of shared chip detection code.

TTDRosen commented 1 month ago

That's a good idea! Probably don't need to make it option either. I could get away with just adding another line above the detection progress to indicate which chip we are looking at.