Open abhullar-tt opened 4 weeks ago
@nvelickovicTT fyi, but Ill debug this once I have some time because it is related to my commit.
Update for this issue:
Last commit where this can be consistently reproduced is https://github.com/tenstorrent/tt-metal/commit/8aa197abdc30ff052709bd58cc54df7427dd9053.
With the next commit (https://github.com/tenstorrent/tt-metal/commit/0679c1988385f6ac0a78008f14be57b9298d40bd) it starts only intermittently hanging.
And with current main (https://github.com/tenstorrent/tt-metal/commit/0d5f4889313d3ef8f86c1555fc378ac9ee81454f) I wasn't able to reproduce the hang.
Also the Watcher+DPrint combination didn't have any effect while I tested. Might be card related.
0d5f488
on the commits where it was hanging were you able to reproduce the hang on different machines?
Yes, I tried on 3 different BH machines, and got the same result.
After https://github.com/tenstorrent/tt-metal/commit/fc8d313510daefc1bb221fb4a6d922799e1a35b7
CommonFixture.MatmulLargeBlock
hangs when running test config "RM input, RM output" only when it is preceded by "Tilized input, RM output"This can be reproduced on main:
The hangs manifests as ncrisc FW not getting go signal to launch the ncrisc kernel. i.e. hang is in
The test passes without Watcher. It also passes with Watcher + DPrint with dprints in the compute kernel. It fails with Watcher + DPrints in ncrisc FW