Open ttmtrajkovic opened 3 months ago
branch: sjovic/rebased_didt_tests_grid_size
(based on ppopovic/rebased_didt_tests
)
commands:
LM head: $ pytest models/demos/falcon7b/tests/test_falcon_hang.py -k "test_grid_size and 8x7 and 2chips"
FF1 no FELU: $ pytest models/experimental/falcon_7b/tests/test_reproduce_hang_matmul.py -k "test_grid_size and 8x7 ff1-hang and 2chips"
FF1 with GELU: $ pytest tests/didt/test_sharded_ff1.py -k "test_grid_size and 8x7 and 2chips"
...
You can choose grid size from the following list: ["1x8", "2x8", "3x8", "4x8", "5x8", "6x8", "7x8", "8x8", "8x7", "8x6", "8x5", "8x4", "8x3", "8x2", "8x1"].
For 1d matmul (LM head), reducing either dimension results in reducing width of the weights accordingly. For 2d matmul (FF1), reducing x dimension results in reducing width of the weights, while reducing y dimension results in reducing height of the activations.
Conclusions
TLDR: Grid size 48 (8x6 or 6x8) showed to be a candidate workaround for didt hangs, with one outlier we are currently reluctant to classify as a didt hang. Once we get new firmware, we will repeat the experiments, and investigate the source of the hang.
DETAILS
Experiment: run 1 million iterations of all repro matmuls, sweeping following grid sizes: [8x8, 8x7, 7x8, 8x6, 6x8, 8x5, 5x8]
Firmware 80.10.3 vs 80.10.2 - the newer one handles some ARC readout bugs.
N150:
N300
sjc-snva-t3005
- The experiment was run on all 8 devices simultaneously, and all tests up to 48 grid size (8x6 or 6x8) pass, including 48.bgd-lab-25
- One board had a hang on LM head 6x8 test, after which remote chip wasn't visible anymore. We are suspecting this is not a didt issue, and we will repeat experiments on this machine with new firmware that fixes more ARC bugs (80.10.4); The second board showed that all tests up to 56 grid size (8x7 or 7x8) pass, including 56.
Vary grid sizes of failing matmuls: 8x8, 7x8, 6x8, 5x8, 4x8, 2x8, 1x8. Make sure that with every variation, every core has identical amount of compute compared to a baseline 8x8 case.
For every grid size, test:
Milos