tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
485 stars 79 forks source link

[Bug Report] Write Barrier in DRAM Sharded MM Reader causing a hang #15279

Open tt-asaigal opened 5 days ago

tt-asaigal commented 5 days ago

Describe the bug Inserting a barrier inside matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_dram_sharded.cpp after a noc_async_write_multicast_loopback_src causes DRAM Sharded Matmul tests to hang consitently.

There is a possibility that this is causing LLAMA to hang (tracked here: https://github.com/tenstorrent/tt-metal/issues/15018).

To Reproduce Please check out asaigal/dram_sharded_reader_hang and run: pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_in1_dram_sharded_with_mm_chain

The test should hang immediately. NOC should report that at least one of the worker NCRISC cores is stuck at waypoint BWW, which corresponds to the barrier issued in the reader kernel.

yugaoTT commented 4 days ago

pushed a change to yugao/hang. with barrier it can pass now because the number of dest set in mcast matches the actual num cores sent. The reason is that when receiving responses from noc, I purposely set the number of cores to be less than the actual num dest (was under the impression that this could save some cycle). Fixing this does not solve hang, as this shouldn't affect the following ops (the sw counter for noc is reset for each op).

jbaumanTT commented 4 days ago

I'm seeing the same hang even with https://github.com/tenstorrent/tt-metal/commit/f3291fb52e6931b28ebd93e1e4571bb456b5b63c . The difference is I've got the noc_async_write_barrier at the end of the kernel, not in the middle, so it's possible the error is happening after the other barrier, or on a different codepath.

yugaoTT commented 4 days ago

@jbaumanTT the hang is for the unit test or llama? if it's the dram sharded unit test, could you please share your code changes so I can repro

jbaumanTT commented 4 days ago

it's for llama (though it might affect the unit test as well). My change is https://github.com/tenstorrent/tt-metal/commit/455961e1101beaa03df0cbe7d6e3b72a37a63be3

yugaoTT commented 4 days ago

@jbaumanTT I pushed a fix to the barrier error, the number of cores set for one type of mcast sender is wrong (should -1).

however, the tested kernel is now passing, but the dispatcher is stuck at noc_async_atomic_barrier, and test hangs at "NABW"

I confirm that all the kernels for MM now has thre barriers at the very end and they passed the barrier.

Could you also try on your end, see if it's a dispatcher issue, or the interaction with MM? Thanks

I pushed to yugao/hang

jbaumanTT commented 3 days ago

That NABW hang in the dispatcher seems to be what we're seeing in #15018. Given that the barriers at the end of this kernel seem to be running correctly it looks like your new patch fixes this DRAM sharded barrier issue, and the barrier problem in this kernel was probably unrelated to the overall hang problem we're seeing.