tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
316 stars 27 forks source link

[Blackhole bringup] Hang in transpose tests #9163

Closed abhullar-tt closed 1 week ago

rtawfik01 commented 2 weeks ago

This test:

TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/test_transpose_wh

passes after these commits: 9d208cc81f & 5212c7d862

but this test:

TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/test_transpose_hc

actually hits a runtime error:

                 Always | FATAL    | Watcher detected NOC error and stopped device: bad alignment in NOC transaction.
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/debug/watcher_server.cpp:367: tt::exception
info:
Watcher detected NOC error and stopped device: bad alignment in NOC transaction.

@abhullar-tt @aliuTT let me know if you can take a look at the second one

rtawfik01 commented 2 weeks ago

These other 2 transpose tests:

TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_transpose_wh_single_core 
TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_transpose_wh_multi_core 

are related from the hang here: #9941

abhullar-tt commented 2 weeks ago
TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_transpose_wh_single_core 
TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=all TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_transpose_wh_multi_core 

these tests are passing after applying the fix on abhullar/noc-header