tenstorrent / tt-smi

Tenstorrent console based hardware information program
Apache License 2.0
21 stars 6 forks source link

Add quality of life feature to reset Nebula-only machine configs without specifying a JSON or MMIO IDs #5

Closed tt-rkim closed 3 months ago

tt-rkim commented 7 months ago

You have to specify a JSON config or MMIO IDs, even when resetting something with an attached Galaxy.

For example, for two Nebulas by PCIe,

tt-smi -r 0,1

as opposed to something like

tt-smi -r

From the impression I got from internal convos with SysEng people, this should be doable and not that complicated for Nebula-only setups on a single host. I hope I'm not wrong

sbansalTT commented 7 months ago

So if i'm understanding correctly, you are looking for both ways to perform a reset - with a json config file, and without a config file, by directly providing the pcie indices of the boards. Please correct me if I misunderstood the question. As of release 2.0.0 you can do both! Link to instructions - https://github.com/tenstorrent/tt-smi?tab=readme-ov-file#resets Do let me know if this answers your question

tt-rkim commented 7 months ago

Hey, in this issue, I'm specifically asking for a 3rd option: without a JSON config file and without providing the PCIe indices of the boards, by defaulting to all PCIe devices found on host.

sbansalTT commented 7 months ago

Hmm I see. I am not fully convinced of this 3rd usecase. What I can do is that the default -r option can look for a json reset config file in the parent folder without needing one provided and use that to perform the reset. I want users to be as verbose with the reset as possible - since we are using it to support varied usecases and there shouldn't be ambiguity with what is expected from smi

tt-rkim commented 7 months ago

That's fair - then I'll go ahead and close this issue soon. It was more so to make our automation a little more convenient, but I think as part of provisioning we can:

Let me know if that's a flow that doesn't really make sense.

tt-rkim commented 3 months ago

@TT-billteng @vtangTT

Here is the old issue about having an all reset

TT-billteng commented 3 months ago

I feel like reseting all the available TT boards in a system should be a valid option. We've had to maintain separate scripts depending on machine configuration just to ensure all boards are in a good state.