tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
391 stars 47 forks source link

Add handling for Galaxy reset in pytest xdist timeout infrastructure #10711

Open tt-rkim opened 1 month ago

tt-rkim commented 1 month ago

Currently, the pytest timeout infrastructure when using xdist in the non-post-commit pipeline uses tt-smi-metal -r {device_ids} to do their reset.

For galaxy, this is insufficient. We need to supply tt-smi with the full reset JSON spec of the host + galaxy in order to properly the entire thing.

We currently are doing this properly in the reset and cleanup scripts in CI, but not in conftest.py where we have the tt-smi reset infrastructure. We need a way of knowing how the current system is a galaxy-type system first, and then getting the reset json from somewhere.

cc: @vtangTT @TT-billteng @ttmchiou @cglagovichTT @aliuTT

ttmchiou commented 1 month ago

a short term patch could be to call the reset.json?

/opt/tt_metal_infra/scripts

tt-smi should support the same reset.json input for single-chip cards so I think we can add a reset.json for single chip cards to make this reset mechanism uniform across every CI runners

Ideally we should try and bring up tt-smi-metal -r all as we previously had though...

TT-billteng commented 1 month ago

I was thinking we could dynamically generate reset.json even for single chip machines and just call tt-smi-metal -r reset.json everywhere