tenstorrent / tt-kmd

Tenstorrent Kernel Module
GNU General Public License v2.0
31 stars 6 forks source link

ARC FW not yet booted after resume from standby #23

Closed afaerber closed 1 month ago

afaerber commented 3 months ago

When having a Grayskull e75 card installed physically but no kernel driver installed yet, my workstation refused to go into standby mode.

With tt-kmd 1.28 kernel module (package) installed on openSUSE Tumbleweed (currently kernel 6.9.5), it does go into standby mode again. However, after resuming, tt-smi 2.2.1 runs into an error:

$ sudo tt-smi
 Detected Chips: 1
 Detecting ARC: -
 Detecting DRAM: |
 [] ETH: |
Gathering Information ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
Traceback (most recent call last):
  File "/usr/bin/tt-smi", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.11/site-packages/tt_smi/tt_smi.py", line 788, in main
    backend = TTSMIBackend(devices)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/tt_smi/tt_smi_backend.py", line 79, in __init__
    self.smbus_telem_info.append(self.get_smbus_board_info(i))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/tt_smi/tt_smi_backend.py", line 173, in get_smbus_board_info
    telem_struct = pylewen_chip.get_telemetry()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: It is not currently safe to communicate with ARC because, ARC FW has not yet booted
   0: luwen_if::chip::grayskull::Grayskull::check_arc_msg_safe
   1: luwen_if::chip::grayskull::Grayskull::get_telemetry_offset
   2: once_cell::imp::OnceCell<T>::initialize::{{closure}}
   3: once_cell::imp::initialize_or_wait
   4: once_cell::imp::OnceCell<T>::initialize
   5: <luwen_if::chip::grayskull::Grayskull as luwen_if::chip::ChipImpl>::get_telemetry
   6: <luwen_if::chip::Chip as luwen_if::chip::ChipImpl>::get_telemetry
   7: pyluwen::_::<impl pyluwen::PciChip>::__pymethod_get_telemetry__
   8: pyo3::impl_::trampoline::trampoline
   9: <unknown>
  10: PyObject_Vectorcall
  11: _PyEval_EvalFrameDefault
  12: <unknown>
  13: _PyObject_FastCallDictTstate
  14: <unknown>
  15: <unknown>
  16: _PyObject_MakeTpCall
  17: _PyEval_EvalFrameDefault
  18: <unknown>
  19: PyEval_EvalCode
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: _PyRun_SimpleFileObject
  24: _PyRun_AnyFileObject
  25: Py_RunMain
  26: Py_BytesMain
  27: __libc_start_call_main
  28: __libc_start_main@GLIBC_2.2.5
  29: _start

The reason that standby is relevant to me is that the Blower on the idle e75 is much louder than any other fan in my workstation.

As a side note, I had encountered a similar ARC FW error already earlier for the initial tt-flash flash. Flashing worked fine with --force argument added though.

alewycky-tenstorrent commented 3 months ago

Grayskull HW doesn't automatically boot firmware, so we need to do that after resume. I'm looking into whether there is any additional state that should be saved & restored.

afaerber commented 3 months ago

I've found that rmmod tenstorrent; modprobe tenstorrent works around the issue. So pure software issue.