tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 25 forks source link

Investigate ttnn resnet 50 batch_size 8 failure on fast dispatch nightly #8555

Open tt-rkim opened 1 month ago

tt-rkim commented 1 month ago

This has been failing for a couple of days:

FAILED tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[batch_size=8-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_l1_small_size=24576] - IndexError: _Map_base::at

Skipping for now

@mywoodstock and @nsmithtt are in the know

tt-rkim commented 1 month ago

https://github.com/tenstorrent/tt-metal/actions/runs/9131122193 Re-running on branch on @nsmithtt 's re-enable commit, because I'm seeing this locally on my GS machine

nsmithtt commented 1 month ago

The error is coming from resharding from height to block, related to https://github.com/tenstorrent/tt-metal/issues/8260, with this fix along with another fix similar to Vraj's https://github.com/tenstorrent/tt-metal/issues/8462 the issue goes away.

tt-rkim commented 1 month ago

Sounds great, let's close this issue once we get those two diffs in and a green pass