tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
430 stars 59 forks source link

CommonFixture.MatmulLargeBlock hangs on BH with Watcher enabled #12666

Open abhullar-tt opened 4 weeks ago

abhullar-tt commented 4 weeks ago

After https://github.com/tenstorrent/tt-metal/commit/fc8d313510daefc1bb221fb4a6d922799e1a35b7 CommonFixture.MatmulLargeBlock hangs when running test config "RM input, RM output" only when it is preceded by "Tilized input, RM output"

This can be reproduced on main:

TT_METAL_WATCHER=5 ./build/test/tt_metal/unit_tests_fast_dispatch --gtest_filter=CommonFixture.MatmulLargeBlock

The hangs manifests as ncrisc FW not getting go signal to launch the ncrisc kernel. i.e. hang is in

ncrisc.cc:

while (*ncrisc_run != RUN_SYNC_MSG_GO);

The test passes without Watcher. It also passes with Watcher + DPrint with dprints in the compute kernel. It fails with Watcher + DPrints in ncrisc FW

rtawfik01 commented 3 weeks ago

@nvelickovicTT fyi, but Ill debug this once I have some time because it is related to my commit.

nvelickovicTT commented 1 week ago

Update for this issue:

Last commit where this can be consistently reproduced is https://github.com/tenstorrent/tt-metal/commit/8aa197abdc30ff052709bd58cc54df7427dd9053.

With the next commit (https://github.com/tenstorrent/tt-metal/commit/0679c1988385f6ac0a78008f14be57b9298d40bd) it starts only intermittently hanging.

And with current main (https://github.com/tenstorrent/tt-metal/commit/0d5f4889313d3ef8f86c1555fc378ac9ee81454f) I wasn't able to reproduce the hang.

Also the Watcher+DPrint combination didn't have any effect while I tested. Might be card related.

abhullar-tt commented 1 week ago

0d5f488

on the commits where it was hanging were you able to reproduce the hang on different machines?

nvelickovicTT commented 1 week ago

Yes, I tried on 3 different BH machines, and got the same result.