Open tt-asaigal opened 5 days ago
pushed a change to yugao/hang. with barrier it can pass now because the number of dest set in mcast matches the actual num cores sent. The reason is that when receiving responses from noc, I purposely set the number of cores to be less than the actual num dest (was under the impression that this could save some cycle). Fixing this does not solve hang, as this shouldn't affect the following ops (the sw counter for noc is reset for each op).
I'm seeing the same hang even with https://github.com/tenstorrent/tt-metal/commit/f3291fb52e6931b28ebd93e1e4571bb456b5b63c . The difference is I've got the noc_async_write_barrier at the end of the kernel, not in the middle, so it's possible the error is happening after the other barrier, or on a different codepath.
@jbaumanTT the hang is for the unit test or llama? if it's the dram sharded unit test, could you please share your code changes so I can repro
it's for llama (though it might affect the unit test as well). My change is https://github.com/tenstorrent/tt-metal/commit/455961e1101beaa03df0cbe7d6e3b72a37a63be3
@jbaumanTT I pushed a fix to the barrier error, the number of cores set for one type of mcast sender is wrong (should -1).
however, the tested kernel is now passing, but the dispatcher is stuck at noc_async_atomic_barrier, and test hangs at "NABW"
I confirm that all the kernels for MM now has thre barriers at the very end and they passed the barrier.
Could you also try on your end, see if it's a dispatcher issue, or the interaction with MM? Thanks
I pushed to yugao/hang
That NABW hang in the dispatcher seems to be what we're seeing in #15018. Given that the barriers at the end of this kernel seem to be running correctly it looks like your new patch fixes this DRAM sharded barrier issue, and the barrier problem in this kernel was probably unrelated to the overall hang problem we're seeing.
Describe the bug Inserting a barrier inside
matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_dram_sharded.cpp
after anoc_async_write_multicast_loopback_src
causes DRAM Sharded Matmul tests to hang consitently.There is a possibility that this is causing LLAMA to hang (tracked here: https://github.com/tenstorrent/tt-metal/issues/15018).
To Reproduce Please check out
asaigal/dram_sharded_reader_hang
and run:pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_in1_dram_sharded_with_mm_chain
The test should hang immediately. NOC should report that at least one of the worker NCRISC cores is stuck at waypoint
BWW
, which corresponds to the barrier issued in the reader kernel.