tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
488 stars 81 forks source link

Test different grid sizes of matmuls #11289

Open ttmtrajkovic opened 3 months ago

ttmtrajkovic commented 3 months ago

Vary grid sizes of failing matmuls: 8x8, 7x8, 6x8, 5x8, 4x8, 2x8, 1x8. Make sure that with every variation, every core has identical amount of compute compared to a baseline 8x8 case.

For every grid size, test:

  1. Test with latest FW 8.10.0.0 (1GHz, w/LL)
  2. Test with latest FW without LL, no margin

Milos

s-jovic commented 3 months ago

branch: sjovic/rebased_didt_tests_grid_size (based on ppopovic/rebased_didt_tests)

commands:

LM head: $ pytest models/demos/falcon7b/tests/test_falcon_hang.py -k "test_grid_size and 8x7 and 2chips"
FF1 no FELU: $ pytest models/experimental/falcon_7b/tests/test_reproduce_hang_matmul.py -k "test_grid_size and 8x7 ff1-hang and 2chips"
FF1 with GELU: $ pytest tests/didt/test_sharded_ff1.py -k "test_grid_size and 8x7 and 2chips"
...

You can choose grid size from the following list: ["1x8", "2x8", "3x8", "4x8", "5x8", "6x8", "7x8", "8x8", "8x7", "8x6", "8x5", "8x4", "8x3", "8x2", "8x1"].

For 1d matmul (LM head), reducing either dimension results in reducing width of the weights accordingly. For 2d matmul (FF1), reducing x dimension results in reducing width of the weights, while reducing y dimension results in reducing height of the activations.

Results from t3k testing

s-jovic commented 2 months ago

Conclusions

TLDR: Grid size 48 (8x6 or 6x8) showed to be a candidate workaround for didt hangs, with one outlier we are currently reluctant to classify as a didt hang. Once we get new firmware, we will repeat the experiments, and investigate the source of the hang.

DETAILS

Experiment: run 1 million iterations of all repro matmuls, sweeping following grid sizes: [8x8, 8x7, 7x8, 8x6, 6x8, 8x5, 5x8]

Firmware 80.10.3 vs 80.10.2 - the newer one handles some ARC readout bugs.

N150:

N300