tenstorrent / tt-smi

Tenstorrent console based hardware information program
Apache License 2.0
18 stars 3 forks source link

tt-smi should enable RUST_BACKTRACE=1 to make it easier to debug luwen failures #9

Closed olofj closed 4 months ago

olofj commented 4 months ago

From a tt-metal CI job:

Run '/opt/tt_metal_infra/scripts/ci/grayskull/cleanup.sh'
Current date / time is Wed Mar 13 1[4](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:5):00:38 UTC 2024
thread '<unnamed>' panicked at crates/kmdif/src/lib.rs:12[5](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:6):9:
DMA buffer allocation on device 0 failed (4194304 bytes) with error ENOMEM: Out of memory
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/opt/tt_metal_infra/provisioning/provisioning_env/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi.py", line [6](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:7)[7](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:8)2, in main
    pci_board_reset(args.reset, reinit=True)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.[8](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:9)/site-packages/tt_smi/tt_smi_backend.py", line 561, in pci_board_reset
    chip = PciChip(pci_interface=pci_idx)
pyo3_runtime.PanicException: DMA buffer allocation on device 0 failed (41[9](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8265770723/job/22612205432#step:12:10)4304 bytes) with error ENOMEM: Out of memory

There's no backtrace from the rust side, which makes it harder to debug. It might be easiest to just have the tt-smi python enable RUST_BACKTRACE early enough to get it across the field, and not needing to do it outside of the scripts.

Cc: @TTDRosen @sbansalTT @tt-rkim

tt-rkim commented 4 months ago

~@TT-billteng @ttmchiou~

~Wondering if we should set RUST_BACKTRACE on our github runner.~

Never mind per bleow

olofj commented 4 months ago

@tt-rkim the whole idea behind this issue is to not pollute your environment with it, and just setting it in tt-smi instead.

sbansalTT commented 4 months ago

Yup on it! Will be part of the upcoming SMI release

sbansalTT commented 4 months ago

fixed as of v2.2.0