tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

[Blackhole bringup] Dram interleaved hangs #9941

Open rtawfik01 opened 3 days ago

rtawfik01 commented 3 days ago

The following tests all hang in the same manner on blackhole:

TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_eltwise_binary_op 
TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_eltwise_unary_op 
TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_bcast_op 
TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_transpose_op 
TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_reduce_op 
TT_METAL_DPRINT_CORES=0,0 TT_METAL_WATCHER=5 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_eager/ops/test_bmm_op 

They all use dram interleaved accesses, and they all hang while attempting to read the first tile in the dataflow commands:

Dump #2 at 12.770s
Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 2): CWFW,NRBW,   R,   D,   D  rmsg:H1G|BNT smsg:GGGG k_ids:2|1|3

Branch to reproduce: rtawfik/bh-fix

abhullar-tt commented 3 days ago

Fix is on abhullar/noc-header

abhullar-tt commented 3 days ago
Passing:
./build/test/tt_eager/ops/test_eltwise_binary_op 
./build/test/tt_eager/ops/test_bcast_op 
./build/test/tt_eager/ops/test_transpose_op 
./build/test/tt_eager/ops/test_reduce_op 
./build/test/tt_eager/ops/test_bmm_op 

Allclose check failing:
./build/test/tt_eager/ops/test_eltwise_unary_op