tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
377 stars 47 forks source link

Matmul 2d hang when subblock h and w > 1 on N300 #8665

Open s-jovic opened 3 months ago

s-jovic commented 3 months ago

There still seem to be hangs with matmuls with 8,8 grid size on N300 (the initial issue: https://github.com/tenstorrent/tt-metal/issues/7066), although much less frequently. Hang is not reproducible on N150.

The problematic matmul is used for FF1 in Falcon7b MLP.

Repro steps:

The test runs 5000 iterations of matmul. On bgd-lab machine, hag appears around 2000th iteration; on a cloud machine, hang appears earlier (around 300th).

To run the test 5000 times with subblock h/w = 1 (to validate it passes), run the following command

mtairum commented 3 months ago

Added branch 8665-mixtral-ff2-hang with a similar matmul test that also hangs after many iterations.

This one specifically matches the shapes for FF2 in the Mixtral8x7b code with subblock_w=2. When running the full model we seen hangs with this config on the 2nd iteration. When running just the unit test below, the hang appears after more than 7000 iterations.

Program cache is enabled in the test. T3k mesh config is used.

Repro steps:

git checkout 8665-mixtral-ff2-hang
./build_metal.sh
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-hang]

The test runs 10000 iterations of matmul. on latest main on sjc-lab-t3002 I'm seeing the hang around iteration 7226.

To run the testwith subblock_w = 1 just use

pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-pass]

cglagovichTT commented 3 months ago

on sjc-snva-t3010, @mtairum 's repro hangs on iteration 129, 130, 129 when I run three times in a row.

The test which passes on his machine,

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-pass]

fails on sjc-snva-t3010 after 1719 iterations.

sjc-snva-t3010 is a machine which we have seen reliably fails on tests which pass on other machines. fyi @davorchap

s-jovic commented 2 months ago

Another repro, already wrote in the Slack channel, but posting here to track all in one place:

$ git checkout sjovic/didt-with-stagger
$ pytest tests/didt/test_sharded_ff1.py

It's a single matmul loop that usually hangs on the first op; the hang is avoided if we apply delay on each matmul block, delaying just the first block doesn't help.

bbradelTT commented 1 month ago

@s-jovic what is the status of inserting the 12k cycle delay for half the grid? And does that fix this issue?

s-jovic commented 1 month ago

We expect it to resolve the issues in combination with the upcoming firmware patch, however this needs to be tested once the firmware is ready. Some more info here: https://github.com/tenstorrent/tt-metal/issues/9857.

bbradelTT commented 1 month ago

We expect it to resolve the issues in combination with the upcoming firmware patch, however this needs to be tested once the firmware is ready. Some more info here: #9857.

Thanks @s-jovic

abhullar-tt commented 1 month ago

Another repro, already wrote in the Slack channel, but posting here to track all in one place:

$ git checkout sjovic/didt-with-stagger
$ pytest tests/didt/test_sharded_ff1.py

It's a single matmul loop that usually hangs on the first op; the hang is avoided if we apply delay on each matmul block, delaying just the first block doesn't help.

@s-jovic is the hang evident using the default loop count in the test?

s-jovic commented 1 month ago

@abhullar-tt yes, it is on problematic n300 chips; adding 'problematic' since not all n300 boards expose this behavior.