tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
397 stars 49 forks source link

ND behavior with `test_matmul_1d_2d.py::test_multi_core_matmul_2d` #7168

Open TT-billteng opened 5 months ago

TT-billteng commented 5 months ago

The ND behavior occurs with configuration M=1792, K=2048, N=2048, and isn't specific to N150 or N300. It either hangs or errors out with bad PCC. To reproduce:

pip install pytest-repeat
pytest --count=1000 --repeat-scope=session --timeout=120 -xv tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_1d_2d.py::test_multi_core_matmul_2d -k '[False-True-True-1792-2048-2048'

Some failing runs in CI:

https://github.com/tenstorrent-metal/tt-metal/actions/runs/8566991717/job/23477858954 https://github.com/tenstorrent-metal/tt-metal/actions/runs/8572995474/job/23497194353 https://github.com/tenstorrent-metal/tt-metal/actions/runs/8572833559/job/23496468681 https://github.com/tenstorrent-metal/tt-metal/actions/runs/8566128457/job/23475414075

TT-billteng commented 4 months ago

this variant is still failing https://github.com/tenstorrent/tt-metal/actions/runs/8789897729/job/24120880686

tt-rkim commented 4 months ago

We are skipping the device perf profiling test for this until it is fixed

cc: @yugaoTT @davorchap

bbradelTT commented 2 months ago

@yugaoTT what's the current status of this issue?

yugaoTT commented 2 months ago

@bbradelTT The issue will be there, as long as didt hasn't been fixed