tenstorrent / tt-smi

Tenstorrent console based hardware information program
Apache License 2.0
18 stars 3 forks source link

TT-SMI reset uses PCI Device ID but only shows chip index starting from 0 #22

Open hmohiuddinTT opened 3 months ago

hmohiuddinTT commented 3 months ago

Summary

TT-SMI shows devices zero-indexed regardless of the device id under /dev/tenstorrent. This is slightly confusing since users wouldn't know the pci index of a card unless they run ls /dev/tenstorrent on a single card container. I don't there's any way to figure out that mapping for multi-card devices.

I think we should add a PCI ID / Device ID field to TT-SMI separate from the chip index. The reset command already accepts the PCI index, so this would just make it easier for the users to figure out which card they want to reset.

Screenshots

image image image
(python_env) user@asrinivasan-test-822bbc87-deployment-64b7856dd4-r2gd2:~$ tt-smi -r 0
thread '<unnamed>' panicked at crates/pyluwen/src/lib.rs:521:70:
called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 0, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }
stack backtrace:
   0:     0x7fa57b289f5b - std::backtrace_rs::backtrace::libunwind::trace::h3926e05c1d1f3b6d
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x7fa57b289f5b - std::backtrace_rs::backtrace::trace_unsynchronized::h9f5691494ac25ae6
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7fa57b289f5b - std::sys_common::backtrace::_print_fmt::h7e6bb7b81bf214f4
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x7fa57b289f5b - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hcf688c88e28c91b4
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7fa57b2bdab0 - core::fmt::rt::Argument::fmt::h59a542682908b618
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/fmt/rt.rs:142:9
   5:     0x7fa57b2bdab0 - core::fmt::write::hce91e70849a27dee
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/fmt/mod.rs:1120:17
   6:     0x7fa57b2802bd - std::io::Write::write_fmt::h0bba58d3b1b495e9
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/io/mod.rs:1762:15
   7:     0x7fa57b289d44 - std::sys_common::backtrace::_print::hf3a4f110a22f16df
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7fa57b289d44 - std::sys_common::backtrace::print::h0450d1fd5fc83f73
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7fa57b2a6a6a - std::panicking::default_hook::{{closure}}::hee7ec73fab21a529
  10:     0x7fa57b2a670d - std::panicking::default_hook::he65be6b11b67d1e4
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:292:9
  11:     0x7fa57b2a6da8 - std::panicking::rust_panic_with_hook::h9e4f07a5a69c9caf
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:779:13
  12:     0x7fa57b28a33e - std::panicking::begin_panic_handler::{{closure}}::h69a9732dd2e7007d
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:657:13
  13:     0x7fa57b28a176 - std::sys_common::backtrace::__rust_end_short_backtrace::hf159dc40d4738bc4
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x7fa57b2a6ad2 - rust_begin_unwind
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:645:5
  15:     0x7fa57b2009f5 - core::panicking::panic_fmt::hf38ef33e65607e17
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/panicking.rs:72:14
  16:     0x7fa57b201103 - core::result::unwrap_failed::h93afb55b612add5a
                               at /build/rustc-60UC9b/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/result.rs:1653:5
  17:     0x7fa57b20821a - pyluwen::PciChip::new::h6568fd9db638c898
  18:     0x7fa57b226c7f - pyluwen::_::_::__INVENTORY::trampoline::h05d38c449707a19e
  19:           0x5d5f53 - _PyObject_MakeTpCall
  20:           0x54d44a - _PyEval_EvalFrameDefault
  21:           0x54552a - _PyEval_EvalCodeWithName
  22:           0x5d5a23 - _PyFunction_Vectorcall
  23:           0x5483b6 - _PyEval_EvalFrameDefault
  24:           0x5d5846 - _PyFunction_Vectorcall
  25:           0x547265 - _PyEval_EvalFrameDefault
  26:           0x54552a - _PyEval_EvalCodeWithName
  27:           0x684327 - PyEval_EvalCode
  28:           0x673a41 - <unknown>
  29:           0x673abb - <unknown>
  30:           0x673b61 - <unknown>
  31:           0x6747e7 - PyRun_SimpleFileExFlags
  32:           0x6b4072 - Py_RunMain
  33:           0x6b43fd - Py_BytesMain
  34:     0x7fa57d81e083 - __libc_start_main
  35:           0x5da67e - _start
  36:                0x0 - <unknown>
Traceback (most recent call last):
  File "/usr/local/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/tt_smi/tt_smi.py", line 733, in main
    pci_board_reset(args.reset, reinit=True)
  File "/usr/local/lib/python3.8/dist-packages/tt_smi/tt_smi_backend.py", line 523, in pci_board_reset
    chip = PciChip(pci_interface=pci_idx)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 0, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }

Also if the wrong device ID is given we should fail more gracefully.