tenstorrent / tt-smi

Tenstorrent console based hardware information program
Apache License 2.0
21 stars 6 forks source link

Resetting N300 board sporadically fails #26

Closed TT-billteng closed 4 months ago

TT-billteng commented 4 months ago

I'm working on a N300 VM instance and running some repeatability tests. My script starts off with resetting the board, but this fails occasionally, which stops my tests.

Just running tt-smi in a loop doesn't work reliably:

ubuntu@tt-metal-dev-andrew-1:~$ for i in {1..10}; do tt-smi-metal -r 0; done
 Starting pci link reset on WH devices at pci indices: 0 
 Finishing pci link reset on WH devices at pci indices: 0 
 Re-initializing boards after reset.... 
 Detected Chips: 2
 Starting pci link reset on WH devices at pci indices: 0 
 Finishing pci link reset on WH devices at pci indices: 0 
 Re-initializing boards after reset.... 
 Detected Chips: 2
 Error when re-initializing chips!
 Chip initialization failed
   0: luwen_if::chip::init::wait_for_init
   1: luwen_if::detect_chips::detect_chips
   2: pyluwen::detect_chips_fallible
   3: pyluwen::_::__pyfunction_detect_chips_fallible
   4: pyo3::impl_::trampoline::trampoline
   5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
   6: <unknown>
   7: _PyEval_EvalFrameDefault
   8: _PyEval_EvalCodeWithName
   9: _PyFunction_Vectorcall
  10: _PyEval_EvalFrameDefault
  11: _PyEval_EvalCodeWithName
  12: _PyFunction_Vectorcall
  13: _PyEval_EvalFrameDefault
  14: _PyFunction_Vectorcall
  15: _PyEval_EvalFrameDefault
  16: _PyEval_EvalCodeWithName
  17: PyEval_EvalCode
  18: <unknown>
  19: <unknown>
  20: <unknown>
  21: PyRun_SimpleFileExFlags
  22: Py_RunMain
  23: Py_BytesMain
  24: __libc_start_main
             at /build/glibc-e2p3jK/glibc-2.31/csu/../csu/libc-start.c:308:16
  25: _start

This fails roughly half the time on my VM.

{
    "time": "2024-05-18T01:43:38.093652",
    "host_info": {
        "OS": "Linux",
        "Distro": "Ubuntu 20.04.3 LTS",
        "Kernel": "5.4.0-182-generic",
        "Hostname": "tt-metal-dev-andrew-1",
        "Platform": "x86_64",
        "Python": "3.8.10",
        "Memory": "47.14 GB",
        "Driver": "TTKMD 1.26"
    },
    "device_info": [
        {
            "smbus_telem": {
                "BOARD_ID": "0x10001451170c10d",
                "SMBUS_TX_ENUM_VERSION": "0xba5e0001",
                "SMBUS_TX_DEVICE_ID": "0x401e1e52",
                "SMBUS_TX_ASIC_RO": "0xd82",
                "SMBUS_TX_ASIC_IDD": "0x2f49e",
                "SMBUS_TX_BOARD_ID_HIGH": "0x1000145",
                "SMBUS_TX_BOARD_ID_LOW": "0x1170c10d",
                "SMBUS_TX_ARC0_FW_VERSION": "0x20f0000",
                "SMBUS_TX_ARC1_FW_VERSION": "0x20f0000",
                "SMBUS_TX_ARC2_FW_VERSION": null,
                "SMBUS_TX_ARC3_FW_VERSION": "0x20f0000",
                "SMBUS_TX_SPIBOOTROM_FW_VERSION": "0x3070001",
                "SMBUS_TX_ETH_FW_VERSION": "0x63000",
                "SMBUS_TX_M3_BL_FW_VERSION": "0x81020000",
                "SMBUS_TX_M3_APP_FW_VERSION": "0x5060000",
                "SMBUS_TX_DDR_SPEED": null,
                "SMBUS_TX_DDR_STATUS": "0x2222222",
                "SMBUS_TX_ETH_STATUS0": "0x11111111",
                "SMBUS_TX_ETH_STATUS1": "0x11111133",
                "SMBUS_TX_PCIE_STATUS": "0x11040000",
                "SMBUS_TX_FAULTS": null,
                "SMBUS_TX_ARC0_HEALTH": "0x66f79",
                "SMBUS_TX_ARC1_HEALTH": "0x2f7fe",
                "SMBUS_TX_ARC2_HEALTH": null,
                "SMBUS_TX_ARC3_HEALTH": "0x54f",
                "SMBUS_TX_FAN_SPEED": "0xffffffff",
                "SMBUS_TX_AICLK": "0x32001f4",
                "SMBUS_TX_AXICLK": "0x384",
                "SMBUS_TX_ARCCLK": "0x21c",
                "SMBUS_TX_THROTTLER": null,
                "SMBUS_TX_VCORE": "0x2d0",
                "SMBUS_TX_ASIC_TEMPERATURE": "0x32902e2",
                "SMBUS_TX_VREG_TEMPERATURE": "0x31002a",
                "SMBUS_TX_BOARD_TEMPERATURE": "0x292928",
                "SMBUS_TX_TDP": "0x550009",
                "SMBUS_TX_TDC": "0xa0000e",
                "SMBUS_TX_VDD_LIMITS": "0x3b602d0",
                "SMBUS_TX_THM_LIMITS": "0x53004b",
                "SMBUS_TX_WH_FW_DATE": "0x38080c2d",
                "SMBUS_TX_ASIC_TMON0": "0x2f262e32",
                "SMBUS_TX_ASIC_TMON1": "0x342c",
                "SMBUS_TX_MVDDQ_POWER": "0x193bfa",
                "SMBUS_TX_GDDR_TRAIN_TEMP0": "0x2426262e",
                "SMBUS_TX_GDDR_TRAIN_TEMP1": "0x2222",
                "SMBUS_TX_BOOT_DATE": "0xda7e",
                "SMBUS_TX_RT_SECONDS": "0x89",
                "SMBUS_TX_AUX_STATUS": null,
                "SMBUS_TX_ETH_DEBUG_STATUS0": "0x11111111",
                "SMBUS_TX_ETH_DEBUG_STATUS1": "0x11111100",
                "SMBUS_TX_TT_FLASH_VERSION": "0x70d0000"
            },
            "board_info": {
                "bus_id": "0000:07:00.0",
                "board_type": "n300 L",
                "board_id": "010001451170c10d",
                "coords": "(0, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": 4,
                "pcie_width": 16
            },
            "telemetry": {
                "voltage": "0.72",
                "current": " 14.0",
                "power": "  9.0",
                "aiclk": " 500",
                "asic_temperature": "46.1"
            },
            "firmwares": {
                "arc_fw": "2.15.0.0",
                "arc_fw_date": "2023-08-08",
                "eth_fw": "6.3.0",
                "m3_bl_fw": "129.2.0.0",
                "m3_app_fw": "5.6.0.0",
                "tt_flash_version": "7.13.0.0"
            },
            "limits": {
                "vdd_min": "0.72",
                "vdd_max": "0.95",
                "tdp_limit": " 85",
                "tdc_limit": "160",
                "asic_fmax": " 800",
                "therm_trip_l1_limit": "83",
                "thm_limit": "75",
                "bus_peak_limit": null
            }
        },
        {
            "smbus_telem": {
                "BOARD_ID": "0x10001451170c10d",
                "SMBUS_TX_ENUM_VERSION": "0xba5e0001",
                "SMBUS_TX_DEVICE_ID": null,
                "SMBUS_TX_ASIC_RO": "0xb19",
                "SMBUS_TX_ASIC_IDD": "0x2ea3d",
                "SMBUS_TX_BOARD_ID_HIGH": "0x1000145",
                "SMBUS_TX_BOARD_ID_LOW": "0x1170c10d",
                "SMBUS_TX_ARC0_FW_VERSION": "0x20f0000",
                "SMBUS_TX_ARC1_FW_VERSION": "0x20f0000",
                "SMBUS_TX_ARC2_FW_VERSION": null,
                "SMBUS_TX_ARC3_FW_VERSION": "0x20f0000",
                "SMBUS_TX_SPIBOOTROM_FW_VERSION": "0x3070001",
                "SMBUS_TX_ETH_FW_VERSION": "0x63000",
                "SMBUS_TX_M3_BL_FW_VERSION": "0x81020000",
                "SMBUS_TX_M3_APP_FW_VERSION": "0x5060000",
                "SMBUS_TX_DDR_SPEED": null,
                "SMBUS_TX_DDR_STATUS": "0x2222222",
                "SMBUS_TX_ETH_STATUS0": "0x11111122",
                "SMBUS_TX_ETH_STATUS1": "0x11111111",
                "SMBUS_TX_PCIE_STATUS": null,
                "SMBUS_TX_FAULTS": null,
                "SMBUS_TX_ARC0_HEALTH": "0x714fe",
                "SMBUS_TX_ARC1_HEALTH": "0x2f7b6",
                "SMBUS_TX_ARC2_HEALTH": null,
                "SMBUS_TX_ARC3_HEALTH": "0x54f",
                "SMBUS_TX_FAN_SPEED": "0xffffffff",
                "SMBUS_TX_AICLK": "0x32001f4",
                "SMBUS_TX_AXICLK": "0x384",
                "SMBUS_TX_ARCCLK": "0x21c",
                "SMBUS_TX_THROTTLER": null,
                "SMBUS_TX_VCORE": "0x2d0",
                "SMBUS_TX_ASIC_TEMPERATURE": "0x2980265",
                "SMBUS_TX_VREG_TEMPERATURE": "0x2e0028",
                "SMBUS_TX_BOARD_TEMPERATURE": "0x292928",
                "SMBUS_TX_TDP": "0x550006",
                "SMBUS_TX_TDC": "0xa00009",
                "SMBUS_TX_VDD_LIMITS": "0x3b602d0",
                "SMBUS_TX_THM_LIMITS": "0x53004b",
                "SMBUS_TX_WH_FW_DATE": "0x38080c2d",
                "SMBUS_TX_ASIC_TMON0": "0x242b2223",
                "SMBUS_TX_ASIC_TMON1": "0x2a28",
                "SMBUS_TX_MVDDQ_POWER": "0x1988d8",
                "SMBUS_TX_GDDR_TRAIN_TEMP0": "0x20262624",
                "SMBUS_TX_GDDR_TRAIN_TEMP1": "0x2624",
                "SMBUS_TX_BOOT_DATE": "0xda7e",
                "SMBUS_TX_RT_SECONDS": "0x89",
                "SMBUS_TX_AUX_STATUS": null,
                "SMBUS_TX_ETH_DEBUG_STATUS0": "0x11111100",
                "SMBUS_TX_ETH_DEBUG_STATUS1": "0x11111111",
                "SMBUS_TX_TT_FLASH_VERSION": "0x70d0000"
            },
            "board_info": {
                "bus_id": "N/A",
                "board_type": "n300 R",
                "board_id": "010001451170c10d",
                "coords": "(1, 0, 0, 0)",
                "dram_status": true,
                "dram_speed": "12G",
                "pcie_speed": "N/A",
                "pcie_width": "N/A"
            },
            "telemetry": {
                "voltage": "0.72",
                "current": "  9.0",
                "power": "  6.0",
                "aiclk": " 500",
                "asic_temperature": "38.3"
            },
            "firmwares": {
                "arc_fw": "2.15.0.0",
                "arc_fw_date": "2023-08-08",
                "eth_fw": "6.3.0",
                "m3_bl_fw": "129.2.0.0",
                "m3_app_fw": "5.6.0.0",
                "tt_flash_version": "7.13.0.0"
            },
            "limits": {
                "vdd_min": "0.72",
                "vdd_max": "0.95",
                "tdp_limit": " 85",
                "tdc_limit": "160",
                "asic_fmax": " 800",
                "therm_trip_l1_limit": "83",
                "thm_limit": "75",
                "bus_peak_limit": null
            }
        }
    ]
}
sbansalTT commented 4 months ago

Hey Bill, thanks for your issue. I think I might know what is causing this issue. Looks like your board is setup to be part of a bigger mesh, and it cannot find the remainder of the chips and so it fails. You can tell because the coords are expected to be (0,0,0,0) and (0,1,0,0) for a standalone nb300 - but yours is (0,0,0,0) and (1,0,0,0) If possible I would recommend flashing your board again with tt-flash and trying your experiment again.

TT-billteng commented 4 months ago

Thanks @sbansalTT , what do the 4 coordinates represent and how do you think it could've reached this state? I'll try flashing again

sbansalTT commented 4 months ago

The coordinates represent the position of the board in the mesh - (x, y, rack, shelf). You can use the tool tt-topology to program these coordinates depending on what multichip setup you would like to use. Most likely the board you were using was part of a multichip system previously and was not flashed back to its original state before being re-purposed as your dev board

TT-billteng commented 4 months ago

Ok, I upgraded FW, the coordinates aren't fixed, but I can reset reliably now 🤔. I guess the coordinates aren't important to the reset process anymore?