tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
456 stars 67 forks source link

[Blackhole] ND hangs on BH CI #11623

Closed abhullar-tt closed 1 month ago

abhullar-tt commented 2 months ago

Opening this issue to track ND issues in BH CI:

e.g. run: https://github.com/tenstorrent/tt-metal/actions/runs/10446932508

The hanging tests pass 10x on IRD (yyzo-bh-26):

TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=CommonFixture.MatmulMultiCoreMultiDRAMIn0MCastIn1MCast --gtest_repeat=10

./build/test/tt_metal/unit_tests_fast_dispatch --gtest_filter=CommonFixture.MatmulLargeBlock --gtest_repeat=10
abhullar-tt commented 2 months ago

@ttmchiou was also seeing that gh01 and gh02 runners were behaving differently (tests passing on gh01 hang on gh02)

ttmchiou commented 2 months ago

For reference sake, We should also explore the case where this ND behavior is caused by previous test cases too. Note that gh02 has not been running on CI when this ND bug was filed

abhullar-tt commented 2 months ago

For reference sake, We should also explore the case where this ND behavior is caused by previous test cases too. Note that gh02 has not been running on CI when this ND bug was filed

Ran test_cpp_unit_tests.sh for 10 iterations on slow dispatch and fast dispatch on bh-26 without issue

abhullar-tt commented 2 months ago

Lists of ND hangs:

https://github.com/tenstorrent/tt-metal/actions/runs/10484423109/job/29039018451 https://github.com/tenstorrent/tt-metal/actions/runs/10546030345/job/29217161896 https://github.com/tenstorrent/tt-metal/actions/runs/10570544383/job/29285410261

abhullar-tt commented 2 months ago

Seeing the same hang on machine gh02 https://github.com/tenstorrent/tt-metal/actions/runs/10637163877/job/29490937662

abhullar-tt commented 1 month ago
abhullar-tt commented 1 month ago

Closing this because we haven't seen ND behaviour beyond #12187 which was addressed by enabling the cmd buffer fifo