DRAM-sharded matmuls ND bad PCC when running on TG

mikevin920 commented 2 months ago

When running DRAM-sharded matmuls on all 32 devices and then concating the results back on host will ND produces garbage outputs

tt-rkim commented 2 months ago

@aliuTT @ubcheema do you guys know what's going on with this one?

ubcheema commented 2 months ago

Can you check if the system has correct galaxy fw on it.

ttmchiou commented 2 months ago

@mikevin920 can you post a github pipeline job with this error?

mikevin920 commented 2 months ago

https://github.com/tenstorrent/tt-metal/actions/runs/10021434594/job/27700225744. Is this galaxy fw correct?

tt-rkim commented 2 months ago

It doesn't print out the firmware there. Looks like it's using an older version of the reset script

@ttmchiou we should update the TGG, TG, and T3K reset scripts to print out SMI JSON information.

Otherwise, for this current work, someone will have to log in and check SMI. It's tough because the machine is in use all the time and we need to be careful with resets.

mikevin920 commented 2 months ago

This is the local machine firmware im seeing consistent issues with

ubcheema commented 2 months ago

This is the last stable Galaxy Fw release: /mnt/motor/syseng/bin/tt-flash/wh/mobo/galaxy_fw_7.14.C.0_2024-07-04-00b2b9f7.tar.gz It runs at 800 MHz with additional voltage margin to account for di/dt.

Tyler Colaco is working on a new release that will run Galaxy at 900 MHz.

@tt-rkim I am on vacation till 08/03. Please sync with Tyler on where you pick the release from.

It's better to just reprogram the machine that @mikevin920 is working on to make sure it has correct fw package on it.

davorchap commented 2 months ago

do we know is this is a sharded DRAM Matmul issue or a model / concat issue?

tt-rkim commented 2 months ago

@mikevin920 did you guys want to try all the galaxy tests (including this one) on an upgraded galaxy in the corporate network (aus glx etc.)?

mikevin920 commented 2 months ago

Yes we will try all unit tests on aus-glx

cglagovichTT commented 2 months ago

My local machine (aus-glx-07) has this tt-smi output:

I'm able to repro ND PCC on branch based off of main @ 374510603bfe4c52afbf34581b62c37da50e1576. My test was a Llama MLP unit test on TG. Running matmul_1d for 1000 iterations gave consistently good PCC. Replacing matmul_1d with dram sharded matmul caused ~1/2 of the iterations to output bad PCC (close to 0).

I also reproed by running this test in a loop. The result was bad PCC every few iterations.

pytest -svv tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded

davorchap commented 2 months ago

fyi @yugaoTT

Is bad PCC reproducible on N300/T3000 using the same DRAM sharded MM shape/config? Wondering if this is specific to Galaxy or a matmul bug?

cglagovichTT commented 2 months ago

Trued to repro on N300 single chip, saw deterministic PCC for 1000/1000 iterations

yugaoTT commented 2 months ago

Can I get a repro procedure? I'll run on my side to see what happens. @cglagovichTT is it only happening at iteration 1000?

cglagovichTT commented 2 months ago

^ I only tested for 1000 iterations on single-chip and all iterations passed

So far we have only seen ND on TG systems with 32 chips running the matmul.

Repro: Get a galaxy from IRD. aus-glx-06 is available now and has the right FW

pytest  tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded

You'll have to remove the pytest.skip on that test locally - we disabled it in CI because of ND

ubcheema commented 2 months ago

What about running on 8 chips, on a T3K machine? Does that also pass?

cglagovichTT commented 2 months ago

I just ran and confirmed that running on 8 chips on a T3K machine passes

davorchap commented 2 months ago

TG on 1, 2, 4, 8, 16 devices is passing ?

davorchap commented 2 months ago

For the 1st level triage, a data collection step helps expedite the debug across the teams. Can you fill this out:

cglagovichTT commented 2 months ago

https://docs.google.com/spreadsheets/d/1xzNCdWPc9-N3_RiEpSEkcGMFOapqBch46fxyoly7XnM/edit?usp=sharing

I will collect these repro steps together and post a branch. As you can see, any number of devices on galaxy running a dram sharded matmul with my shapes has ND PCC

pavlepopovic commented 1 month ago

Ran on 2t3k machines and 1 n150 today over 1mil iterations on each, no sign of non determinism

yugaoTT commented 1 month ago

could it be caused by dram sharding of the weight tensor? test_galaxy_matmul_2d_fracture_dram_sharded is calling the dram sharded matmul version, but can also try tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py::test_matmul_2d_in1_dram_sharded which is calling the normal matmul_2d, but with sharded in1 tensor

cglagovichTT commented 1 month ago

@yugaoTT this ND for this matmul shape has only shown up for dram sharded weight tensors. We have seen no problems using matmul_1d for this matmul

yugaoTT commented 1 month ago

So we have two versions of matmul that both support dram sharded weight: matmul dram sharded (MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig) and matmul_2d (MatmulMultiCoreReuseMultiCastProgramConfig).

I think in your test it's testing the first one right? the second one might also have problems if dram sharding is causing the problem.

cglagovichTT commented 1 month ago

Yes we are using the first program config in our tests

cglagovichTT commented 1 month ago

On TG:

I reproed ND PCC with the following configs:

matmul_2d on 1x8 coregrid, activation interleaved, weight dram sharded, output interleaved
matmul_2d on 1x8 coregrid, activation interleaved, weight interleaved, output interleaved

I confirmed that the matmul used the expected kernel bmm_large_block_zm_fused_bias_activation.cpp, so this is a different issue than #10936. This is the tracy for the failing matmul ops_perf_results_2024_08_14_17_17_25.csv

Running on smaller clusters than 4x8 makes the ND PCC less frequent, probably because the op is executed fewer times in total.

Example of ND PCC for config 2

Repro:

branch cglagovich/10673_repro
pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded

cglagovichTT commented 1 month ago

Updated the branch to automatically sweep 4 types of matmuls:

dram sharded matmul - act sharded, weight sharded
matmul_1d - act sharded, weight interleaved
matmul_2d - act interleaved, weight sharded
matmul_2d - act interleaved, weight interleaved

I sweep these matmuls with two different input shapes which are used in Llama. 7/8 tests fail due to non-determinism.

Repro: This repro will run these 8 combinations for 2000 iterations each. I expect 7/8 tests to fail due to ND PCC. branch cglagovich/10673_repro pytest -svv tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture_dram_sharded

cglagovichTT commented 1 month ago

Ran these same exact matmul shapes and configs on t3k 8 chips for 1000 iterations each. All passed

Update: I re-ran each config on t3k for 100k iterations each - everything was deterministic.

cglagovichTT commented 1 month ago

@TT-BrianLiu @yugaoTT I'm assigning to you for further triage/debug. LMK if you need more information from me

cglagovichTT commented 1 month ago

Retried on aus-glx-01 because we believed ND could be a machine-specific issue.

cglagovichTT commented 1 month ago

Set clock to 500MHz by removing set_power_state from tt_silicon_driver::deassert_resets_and_set_power_state. I re-ran the 4 previously failing tests:

500 MHz clock works around the ND PCC issues on aus-glx-01.

yugaoTT commented 1 month ago

this sounds similar to an issue that I met a long time ago, when debugging the di/dt problem, sometimes the pcc is also garbage even if it does not hang.

cglagovichTT commented 1 month ago

Running some experiments (back at 900 M Hz) to see if a specific device consistently produces ND output.

I modified the inputs such that activations are all ones and weights are rand. Weights and activations are replicated across all 32 chips, so I can tell exactly which chip produced ND outputs by comparing to ground truth.

click to expand images

Early result: device (6,1) produced 29.0 where previous runs produced 29.5.

On a subsequent iteration, (6,1) fails again but on a different column, again off by 0.5

3rd failing iteration

4th fail

5th fail

I'm still collecting data for all of the tests, but this is what I'm seeing right now. Certain chips are much more likely to fail:

cglagovichTT commented 1 month ago

On another aus-glx-01 run @ 900 MHz, these are the stats I've collected across 1000 runs for the failing configs:

Dram sharded matmul FF2

Chip (0, 1) failed 4 times
Index: (0, 123), Diff: 0.25 failed 3 times
Index: (0, 155), Diff: 0.25 failed 1 times
Chip (5, 0) failed 8 times
Index: (0, 3), Diff: 0.5 failed 5 times
Index: (0, 263), Diff: 0.25 failed 3 times
Chip (4, 1) failed 10 times
Index: (0, 263), Diff: 0.25 failed 9 times
Index: (0, 3), Diff: 0.5 failed 1 times
Chip (4, 3) failed 28 times
Index: (0, 591), Diff: 0.25 failed 14 times
Index: (0, 279), Diff: 0.25 failed 7 times
Index: (0, 263), Diff: 0.25 failed 4 times
Index: (0, 615), Diff: 0.5 failed 2 times
Index: (0, 455), Diff: 0.5 failed 1 times
Chip (2, 0) failed 2 times
Index: (0, 455), Diff: 0.5 failed 2 times
Chip (6, 1) failed 8 times
Index: (0, 3), Diff: 0.5 failed 6 times
Index: (0, 547), Diff: 0.5 failed 2 times
Chip (7, 1) failed 1 times
Index: (0, 175), Diff: -0.5 failed 1 times
Chip (0, 0) failed 4 times
Index: (0, 163), Diff: -inf failed 2 times
Index: (0, 171), Diff: -inf failed 1 times
Index: (0, 459), Diff: -inf failed 1 times
Chip (6, 3) failed 1 times
Index: (0, 479), Diff: 0.5 failed 1 times
Chip (4, 0) failed 1 times
Index: (0, 83), Diff: 0.5 failed 1 times

FF1

Chip (5, 0) failed 8 times
Index: (0, 79), Diff: 0.5 failed 5 times
Index: (0, 71), Diff: 1.0 failed 2 times
Index: (0, 59), Diff: 0.5 failed 1 times
Chip (4, 3) failed 130 times
Index: (0, 79), Diff: 0.5 failed 130 times
Chip (7, 1) failed 7 times
Index: (0, 79), Diff: 0.5 failed 7 times
Chip (0, 0) failed 2 times
Index: (0, 73), Diff: -0.25 failed 1 times
Index: (0, 77), Diff: -0.5 failed 1 times
Chip (4, 1) failed 3 times
Index: (0, 71), Diff: 1.0 failed 2 times
Index: (0, 79), Diff: 0.5 failed 1 times
Chip (5, 1) failed 4 times
Index: (0, 79), Diff: 0.5 failed 4 times
Chip (6, 1) failed 1 times
Index: (0, 81), Diff: 0.5 failed 1 times

cglagovichTT commented 1 month ago

Another run for 2000 iterations:

dram sharded FF1

----------
Chip (4, 1) failed 43 times
Columns that failed:
Index: (0, 527), Diff: 0.5 failed 24 times
Index: (0, 167), Diff: 0.5 failed 18 times
Index: (0, 27), Diff: 0.25 failed 1 times
----------

----------
Chip (5, 0) failed 13 times
Columns that failed:
Index: (0, 527), Diff: 0.5 failed 10 times
Index: (0, 167), Diff: 0.5 failed 3 times
----------

----------
Chip (0, 1) failed 12 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 7 times
Index: (0, 459), Diff: 0.5 failed 4 times
Index: (0, 527), Diff: 0.5 failed 1 times
----------

----------
Chip (2, 0) failed 1 times
Columns that failed:
Index: (0, 595), Diff: 0.5 failed 1 times
----------

----------
Chip (4, 3) failed 4 times
Columns that failed:
Index: (0, 167), Diff: 0.5 failed 3 times
Index: (0, 559), Diff: 0.5 failed 1 times
----------

----------
Chip (6, 1) failed 5 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 2 times
Index: (0, 459), Diff: 0.5 failed 2 times
Index: (0, 387), Diff: 0.5 failed 1 times
----------

----------
Chip (0, 0) failed 10 times
Columns that failed:
Index: (0, 243), Diff: -2.0 failed 4 times
Index: (0, 203), Diff: -inf failed 4 times
Index: (0, 51), Diff: -inf failed 1 times
Index: (0, 107), Diff: -inf failed 1 times
----------

----------
Chip (4, 2) failed 2 times
Columns that failed:
Index: (0, 27), Diff: 0.25 failed 2 times
----------

----------
Chip (7, 1) failed 1 times
Columns that failed:
Index: (0, 167), Diff: 0.5 failed 1 times
----------

----------
Chip (5, 1) failed 1 times
Columns that failed:
Index: (0, 459), Diff: 0.5 failed 1 times
----------

dram sharded FF2

----------
Chip (0, 0) failed 12 times
Columns that failed:
Index: (0, 17), Diff: -inf failed 6 times
Index: (0, 73), Diff: -1.0 failed 3 times
Index: (0, 81), Diff: -1.0 failed 1 times
Index: (0, 45), Diff: 4.0 failed 1 times
Index: (0, 33), Diff: -0.5 failed 1 times
----------

----------
Chip (6, 1) failed 3 times
Columns that failed:
Index: (0, 33), Diff: 0.5 failed 3 times
----------

----------
Chip (2, 1) failed 1 times
Columns that failed:
Index: (0, 107), Diff: 1.5 failed 1 times
----------

cglagovichTT commented 1 month ago

@yugaoTT and I are starting to believe that this ND PCC is caused by a bit flip during transmission of a tile from DRAM -> L1.

The matmul of ones @ rand produces an output where each column of the output is the sum of a column of the weights
Each datum in a column of the output is expected to be equal
Whenever expected and out differ (when we have ND), they differ on one or more columns, where every element in out[:,i] is different from expected[:,i] by the same amount
If a bit is flipped in the weight matrix, you would expect a column of the output to differ by a power of 2
The above data shows that the difference between expected and out is always a power of 2 or sum of powers of 2
DRAM sharded matmuls show ND PCC most frequently, and they apply the most pressure on DRAM
matmul_1d with L1 interleaved weights does not have ND PCC

yugaoTT commented 1 month ago

if some pins failed between dram and chip (I assume it will be always giving 0 for that pin), and if weight are positive numbers, then we would alway have computed value smaller then original?

cglagovichTT commented 1 month ago

This is the data on the number of failures per chip for each of the 8 configurations I ran. There's a pattern, where chips (3,0) and (7,0) are most likely to fail on all cases. For dram sharded matmuls, there are 7 chips which fail >40% of the time. chip_failures.txt

cglagovichTT commented 1 month ago

I re-ran on only chip 0,0 and got 2154 failures in 10k iterations. This failure does not only occur in 32-chip workloads...

cglagovichTT commented 1 month ago

Discussion continued in slack https://tenstorrent.slack.com/archives/C07384DMYJC/p1724331544384049

TLDR; bank 0 and 1 (GDDR0) are the only banks that we have seen fail. We separated compute from data movement and showed that the weights are still getting corrupted.

johanna-rock-tt commented 1 day ago

What's the current status of this issue? Has this been fixed in the meantime?

cglagovichTT commented 1 day ago

Alex Buck has been looking into this on the hardware side. Last I heard, he was able to repro data corruption in their own DRAM tests, and he has been able to repro my test as well. No workaround yet.

cglagovichTT commented 19 minutes ago

Latest info from Alex:

It looks like it might be a voltage margin related issue. Boosting one of the DRAM related voltages mitigates the issue. I'm trying to test on our larger testing systems to make sure we don't break anything else prior to releasing a FW update

tenstorrent / tt-metal

DRAM-sharded matmuls ND bad PCC when running on TG #10673