Open hmohiuddinTT opened 6 months ago
Ok nvm, it looks like I just need to wait for the training to timeout:
ansible@g14cs03:~$ tt-smi -g
Detected Chips: 4
Generated sample reset config file for this host: /home/ansible/.config/tenstorrent/reset_config.json
Update the generated file and use it as an input for the -r/--reset option.
ansible@g14cs03:~$ cat ~/.config/tenstorrent/reset_config.json
{
"time": "2024-05-15T23:17:08.833710",
"host_name": "g14cs03",
"gs_tensix_reset": {
"pci_index": []
},
"wh_link_reset": {
"pci_index": [
0,
1,
2,
3
]
},
"re_init_devices": true,
"wh_mobo_reset": [
{
"nb_host_pci_idx": [
0,
1,
2,
3
],
"mobo": "<MOBO NAME>",
"credo": [
"<group id>:<credo id>",
"<group id>:<credo id>"
],
"disabled_ports": [
"<group id>:<credo id>",
"<group id>:<credo id>"
]
},
{
"nb_host_pci_idx": [
0,
1,
2,
3
],
"mobo": "<MOBO NAME>",
"credo": [
"<group id>:<credo id>",
"<group id>:<credo id>"
],
"disabled_ports": [
"<group id>:<credo id>",
"<group id>:<credo id>"
]
}
]
}
Might still be useful to have a countdown or timeout in tt-smi while the training is ongoing.
Yeah thanks for raising this - I don't think the generation of this config file should not have any eth training detection, I'll take a look and see if I can separate the two. I'll also add some kind of indicators to users to wait
Summary
This is a chicken and egg problem:
TLDR: Cannot generate reset_config needed for resetting, without first resetting.