tenstorrent / tt-smi

Tenstorrent console based hardware information program
Apache License 2.0
23 stars 6 forks source link

tt-smi -g hangs when remote chips inaccessible #25

Open hmohiuddinTT opened 6 months ago

hmohiuddinTT commented 6 months ago

Summary

This is a chicken and egg problem:

  1. User sets up TG and tries to reset the system to get it into a good state.
  2. In order to reset they need a reset_config.json.
  3. In order to generate a reset_config.json they run tt-smi -g.
  4. Since boards are not able to talk to remote chips yet, this will hang.

TLDR: Cannot generate reset_config needed for resetting, without first resetting.

image
hmohiuddinTT commented 6 months ago

Ok nvm, it looks like I just need to wait for the training to timeout:

ansible@g14cs03:~$ tt-smi -g
 Detected Chips: 4
 Generated sample reset config file for this host: /home/ansible/.config/tenstorrent/reset_config.json 
 Update the generated file and use it as an input for the -r/--reset option. 
ansible@g14cs03:~$ cat ~/.config/tenstorrent/reset_config.json 
{
    "time": "2024-05-15T23:17:08.833710",
    "host_name": "g14cs03",
    "gs_tensix_reset": {
        "pci_index": []
    },
    "wh_link_reset": {
        "pci_index": [
            0,
            1,
            2,
            3
        ]
    },
    "re_init_devices": true,
    "wh_mobo_reset": [
        {
            "nb_host_pci_idx": [
                0,
                1,
                2,
                3
            ],
            "mobo": "<MOBO NAME>",
            "credo": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ],
            "disabled_ports": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ]
        },
        {
            "nb_host_pci_idx": [
                0,
                1,
                2,
                3
            ],
            "mobo": "<MOBO NAME>",
            "credo": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ],
            "disabled_ports": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ]
        }
    ]
}
hmohiuddinTT commented 6 months ago

Might still be useful to have a countdown or timeout in tt-smi while the training is ongoing.

sbansalTT commented 6 months ago

Yeah thanks for raising this - I don't think the generation of this config file should not have any eth training detection, I'll take a look and see if I can separate the two. I'll also add some kind of indicators to users to wait