Open mikevin920 opened 2 months ago
@aliuTT @ubcheema do you guys know what's going on with this one?
Can you check if the system has correct galaxy fw on it.
@mikevin920 can you post a github pipeline job with this error?
https://github.com/tenstorrent/tt-metal/actions/runs/10021434594/job/27700225744. Is this galaxy fw correct?
It doesn't print out the firmware there. Looks like it's using an older version of the reset script
@ttmchiou we should update the TGG, TG, and T3K reset scripts to print out SMI JSON information.
Otherwise, for this current work, someone will have to log in and check SMI. It's tough because the machine is in use all the time and we need to be careful with resets.
This is the local machine firmware im seeing consistent issues with
This is the last stable Galaxy Fw release:
/mnt/motor/syseng/bin/tt-flash/wh/mobo/galaxy_fw_7.14.C.0_2024-07-04-00b2b9f7.tar.gz
It runs at 800 MHz with additional voltage margin to account for di/dt.
Tyler Colaco is working on a new release that will run Galaxy at 900 MHz.
@tt-rkim I am on vacation till 08/03. Please sync with Tyler on where you pick the release from.
It's better to just reprogram the machine that @mikevin920 is working on to make sure it has correct fw package on it.
do we know is this is a sharded DRAM Matmul issue or a model / concat issue?
@mikevin920 did you guys want to try all the galaxy tests (including this one) on an upgraded galaxy in the corporate network (aus glx etc.)?
Yes we will try all unit tests on aus-glx
My local machine (aus-glx-07) has this tt-smi output:
I'm able to repro ND PCC on branch based off of main @ 374510603bfe4c52afbf34581b62c37da50e1576. My test was a Llama MLP unit test on TG. Running matmul_1d for 1000 iterations gave consistently good PCC. Replacing matmul_1d with dram sharded matmul caused ~1/2 of the iterations to output bad PCC (close to 0).
I also reproed by running this test in a loop. The result was bad PCC every few iterations.
pytest -svv tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded
fyi @yugaoTT
Is bad PCC reproducible on N300/T3000 using the same DRAM sharded MM shape/config? Wondering if this is specific to Galaxy or a matmul bug?
Trued to repro on N300 single chip, saw deterministic PCC for 1000/1000 iterations
Can I get a repro procedure? I'll run on my side to see what happens. @cglagovichTT is it only happening at iteration 1000?
^ I only tested for 1000 iterations on single-chip and all iterations passed
So far we have only seen ND on TG systems with 32 chips running the matmul.
Repro: Get a galaxy from IRD. aus-glx-06 is available now and has the right FW
pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded
You'll have to remove the pytest.skip on that test locally - we disabled it in CI because of ND
What about running on 8 chips, on a T3K machine? Does that also pass?
I just ran and confirmed that running on 8 chips on a T3K machine passes
TG on 1, 2, 4, 8, 16 devices is passing ?
For the 1st level triage, a data collection step helps expedite the debug across the teams. Can you fill this out:
https://docs.google.com/spreadsheets/d/1xzNCdWPc9-N3_RiEpSEkcGMFOapqBch46fxyoly7XnM/edit?usp=sharing
I will collect these repro steps together and post a branch. As you can see, any number of devices on galaxy running a dram sharded matmul with my shapes has ND PCC
Ran on 2t3k machines and 1 n150 today over 1mil iterations on each, no sign of non determinism
could it be caused by dram sharding of the weight tensor? test_galaxy_matmul_2d_fracture_dram_sharded
is calling the dram sharded matmul version, but can also try tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_2d_in1_dram_sharded
which is calling the normal matmul_2d, but with sharded in1 tensor
@yugaoTT this ND for this matmul shape has only shown up for dram sharded weight tensors. We have seen no problems using matmul_1d for this matmul
So we have two versions of matmul that both support dram sharded weight: matmul dram sharded (MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig) and matmul_2d (MatmulMultiCoreReuseMultiCastProgramConfig).
I think in your test it's testing the first one right? the second one might also have problems if dram sharding is causing the problem.
Yes we are using the first program config in our tests
On TG:
I reproed ND PCC with the following configs:
I confirmed that the matmul used the expected kernel bmm_large_block_zm_fused_bias_activation.cpp
, so this is a different issue than #10936. This is the tracy for the failing matmul
ops_perf_results_2024_08_14_17_17_25.csv
Running on smaller clusters than 4x8 makes the ND PCC less frequent, probably because the op is executed fewer times in total.
Example of ND PCC for config 2
Repro:
pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded
Updated the branch to automatically sweep 4 types of matmuls:
I sweep these matmuls with two different input shapes which are used in Llama. 7/8 tests fail due to non-determinism.
Repro:
This repro will run these 8 combinations for 2000 iterations each. I expect 7/8 tests to fail due to ND PCC.
branch cglagovich/10673_repro
pytest -svv tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded
Ran these same exact matmul shapes and configs on t3k 8 chips for 1000 iterations each. All passed
Update: I re-ran each config on t3k for 100k iterations each - everything was deterministic.
@TT-BrianLiu @yugaoTT I'm assigning to you for further triage/debug. LMK if you need more information from me
Retried on aus-glx-01 because we believed ND could be a machine-specific issue.
Set clock to 500MHz by removing set_power_state
from tt_silicon_driver::deassert_resets_and_set_power_state
. I re-ran the 4 previously failing tests:
500 MHz clock works around the ND PCC issues on aus-glx-01.
this sounds similar to an issue that I met a long time ago, when debugging the di/dt problem, sometimes the pcc is also garbage even if it does not hang.
Running some experiments (back at 900 M Hz) to see if a specific device consistently produces ND output.
I modified the inputs such that activations are all ones and weights are rand. Weights and activations are replicated across all 32 chips, so I can tell exactly which chip produced ND outputs by comparing to ground truth.
On another aus-glx-01 run @ 900 MHz, these are the stats I've collected across 1000 runs for the failing configs:
Dram sharded matmul FF2
Chip (0, 1) failed 4 times Index: (0, 123), Diff: 0.25 failed 3 times Index: (0, 155), Diff: 0.25 failed 1 times Chip (5, 0) failed 8 times Index: (0, 3), Diff: 0.5 failed 5 times Index: (0, 263), Diff: 0.25 failed 3 times Chip (4, 1) failed 10 times Index: (0, 263), Diff: 0.25 failed 9 times Index: (0, 3), Diff: 0.5 failed 1 times Chip (4, 3) failed 28 times Index: (0, 591), Diff: 0.25 failed 14 times Index: (0, 279), Diff: 0.25 failed 7 times Index: (0, 263), Diff: 0.25 failed 4 times Index: (0, 615), Diff: 0.5 failed 2 times Index: (0, 455), Diff: 0.5 failed 1 times Chip (2, 0) failed 2 times Index: (0, 455), Diff: 0.5 failed 2 times Chip (6, 1) failed 8 times Index: (0, 3), Diff: 0.5 failed 6 times Index: (0, 547), Diff: 0.5 failed 2 times Chip (7, 1) failed 1 times Index: (0, 175), Diff: -0.5 failed 1 times Chip (0, 0) failed 4 times Index: (0, 163), Diff: -inf failed 2 times Index: (0, 171), Diff: -inf failed 1 times Index: (0, 459), Diff: -inf failed 1 times Chip (6, 3) failed 1 times Index: (0, 479), Diff: 0.5 failed 1 times Chip (4, 0) failed 1 times Index: (0, 83), Diff: 0.5 failed 1 times
FF1
Chip (5, 0) failed 8 times Index: (0, 79), Diff: 0.5 failed 5 times Index: (0, 71), Diff: 1.0 failed 2 times Index: (0, 59), Diff: 0.5 failed 1 times Chip (4, 3) failed 130 times Index: (0, 79), Diff: 0.5 failed 130 times Chip (7, 1) failed 7 times Index: (0, 79), Diff: 0.5 failed 7 times Chip (0, 0) failed 2 times Index: (0, 73), Diff: -0.25 failed 1 times Index: (0, 77), Diff: -0.5 failed 1 times Chip (4, 1) failed 3 times Index: (0, 71), Diff: 1.0 failed 2 times Index: (0, 79), Diff: 0.5 failed 1 times Chip (5, 1) failed 4 times Index: (0, 79), Diff: 0.5 failed 4 times Chip (6, 1) failed 1 times Index: (0, 81), Diff: 0.5 failed 1 times
Another run for 2000 iterations:
dram sharded FF1
----------
Chip (4, 1) failed 43 times
Columns that failed:
Index: (0, 527), Diff: 0.5 failed 24 times
Index: (0, 167), Diff: 0.5 failed 18 times
Index: (0, 27), Diff: 0.25 failed 1 times
----------
----------
Chip (5, 0) failed 13 times
Columns that failed:
Index: (0, 527), Diff: 0.5 failed 10 times
Index: (0, 167), Diff: 0.5 failed 3 times
----------
----------
Chip (0, 1) failed 12 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 7 times
Index: (0, 459), Diff: 0.5 failed 4 times
Index: (0, 527), Diff: 0.5 failed 1 times
----------
----------
Chip (2, 0) failed 1 times
Columns that failed:
Index: (0, 595), Diff: 0.5 failed 1 times
----------
----------
Chip (4, 3) failed 4 times
Columns that failed:
Index: (0, 167), Diff: 0.5 failed 3 times
Index: (0, 559), Diff: 0.5 failed 1 times
----------
----------
Chip (6, 1) failed 5 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 2 times
Index: (0, 459), Diff: 0.5 failed 2 times
Index: (0, 387), Diff: 0.5 failed 1 times
----------
----------
Chip (0, 0) failed 10 times
Columns that failed:
Index: (0, 243), Diff: -2.0 failed 4 times
Index: (0, 203), Diff: -inf failed 4 times
Index: (0, 51), Diff: -inf failed 1 times
Index: (0, 107), Diff: -inf failed 1 times
----------
----------
Chip (4, 2) failed 2 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 2 times
----------
----------
Chip (7, 1) failed 1 times
Columns that failed:
Index: (0, 167), Diff: 0.5 failed 1 times
----------
----------
Chip (5, 1) failed 1 times
Columns that failed:
Index: (0, 459), Diff: 0.5 failed 1 times
----------
dram sharded FF2
----------
Chip (0, 0) failed 12 times
Columns that failed:
Index: (0, 17), Diff: -inf failed 6 times
Index: (0, 73), Diff: -1.0 failed 3 times
Index: (0, 81), Diff: -1.0 failed 1 times
Index: (0, 45), Diff: 4.0 failed 1 times
Index: (0, 33), Diff: -0.5 failed 1 times
----------
----------
Chip (6, 1) failed 3 times
Columns that failed:
Index: (0, 33), Diff: 0.5 failed 3 times
----------
----------
Chip (2, 1) failed 1 times
Columns that failed:
Index: (0, 107), Diff: 1.5 failed 1 times
----------
@yugaoTT and I are starting to believe that this ND PCC is caused by a bit flip during transmission of a tile from DRAM -> L1.
ones @ rand
produces an output where each column of the output is the sum of a column of the weightsexpected
and out
differ (when we have ND), they differ on one or more columns, where every element in out[:,i]
is different from expected[:,i]
by the same amountexpected
and out
is always a power of 2 or sum of powers of 2if some pins failed between dram and chip (I assume it will be always giving 0 for that pin), and if weight are positive numbers, then we would alway have computed value smaller then original?
This is the data on the number of failures per chip for each of the 8 configurations I ran. There's a pattern, where chips (3,0) and (7,0) are most likely to fail on all cases. For dram sharded matmuls, there are 7 chips which fail >40% of the time. chip_failures.txt
I re-ran on only chip 0,0 and got 2154 failures in 10k iterations. This failure does not only occur in 32-chip workloads...
Discussion continued in slack https://tenstorrent.slack.com/archives/C07384DMYJC/p1724331544384049
TLDR; bank 0 and 1 (GDDR0) are the only banks that we have seen fail. We separated compute from data movement and showed that the weights are still getting corrupted.
What's the current status of this issue? Has this been fixed in the meantime?
Alex Buck has been looking into this on the hardware side. Last I heard, he was able to repro data corruption in their own DRAM tests, and he has been able to repro my test as well. No workaround yet.
Latest info from Alex:
It looks like it might be a voltage margin related issue. Boosting one of the DRAM related voltages mitigates the issue. I'm trying to test on our larger testing systems to make sure we don't break anything else prior to releasing a FW update
When running DRAM-sharded matmuls on all 32 devices and then concating the results back on host will ND produces garbage outputs