Open s-jovic opened 3 months ago
Added branch 8665-mixtral-ff2-hang
with a similar matmul test that also hangs after many iterations.
This one specifically matches the shapes for FF2 in the Mixtral8x7b code with subblock_w=2. When running the full model we seen hangs with this config on the 2nd iteration. When running just the unit test below, the hang appears after more than 7000 iterations.
Program cache is enabled in the test. T3k mesh config is used.
Repro steps:
git checkout 8665-mixtral-ff2-hang
./build_metal.sh
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-hang]
The test runs 10000 iterations of matmul. on latest main on sjc-lab-t3002 I'm seeing the hang around iteration 7226.
To run the testwith subblock_w = 1 just use
pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-pass]
on sjc-snva-t3010, @mtairum 's repro hangs on iteration 129, 130, 129 when I run three times in a row.
The test which passes on his machine,
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest models/demos/t3000/mixtral8x7b/tests/test_di_dt_mixtral_mlp_hang.py::test_mixtral_mlp_hang[wormhole_b0-True-ff2-pass]
fails on sjc-snva-t3010 after 1719 iterations.
sjc-snva-t3010 is a machine which we have seen reliably fails on tests which pass on other machines. fyi @davorchap
Another repro, already wrote in the Slack channel, but posting here to track all in one place:
$ git checkout sjovic/didt-with-stagger
$ pytest tests/didt/test_sharded_ff1.py
It's a single matmul loop that usually hangs on the first op; the hang is avoided if we apply delay on each matmul block, delaying just the first block doesn't help.
@s-jovic what is the status of inserting the 12k cycle delay for half the grid? And does that fix this issue?
We expect it to resolve the issues in combination with the upcoming firmware patch, however this needs to be tested once the firmware is ready. Some more info here: https://github.com/tenstorrent/tt-metal/issues/9857.
We expect it to resolve the issues in combination with the upcoming firmware patch, however this needs to be tested once the firmware is ready. Some more info here: #9857.
Thanks @s-jovic
Another repro, already wrote in the Slack channel, but posting here to track all in one place:
$ git checkout sjovic/didt-with-stagger $ pytest tests/didt/test_sharded_ff1.py
It's a single matmul loop that usually hangs on the first op; the hang is avoided if we apply delay on each matmul block, delaying just the first block doesn't help.
@s-jovic is the hang evident using the default loop count in the test?
@abhullar-tt yes, it is on problematic n300 chips; adding 'problematic' since not all n300 boards expose this behavior.
There still seem to be hangs with matmuls with 8,8 grid size on N300 (the initial issue: https://github.com/tenstorrent/tt-metal/issues/7066), although much less frequently. Hang is not reproducible on N150.
The problematic matmul is used for FF1 in Falcon7b MLP.
Repro steps:
sjovic/matmul-hang-repro
The test runs 5000 iterations of matmul. On bgd-lab machine, hag appears around 2000th iteration; on a cloud machine, hang appears earlier (around 300th).
To run the test 5000 times with subblock h/w = 1 (to validate it passes), run the following command