Open bbradelTT opened 4 months ago
matmul/short/matmul.py
8 PCC too low failures
Will need to update PCC: pcc .999 -> .9988 fixes failures
I re-ran sweeps, including a small change I haven't merged yet. Things haven't moved much. I'll keep on looking at individual failures.
matmul/full/matmul_default_width_sharded.py|crashed|814
matmul/full/matmul_default_width_sharded.py|passed|2354
matmul/full/matmul_default_height_sharded.py|crashed|374
matmul/full/matmul_default_height_sharded.py|passed|2058
matmul/full/matmul_default_block_sharded.py|crashed|15834
matmul/full/matmul_default_block_sharded.py|passed|2118
pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_width_sharded -k matmul_default_width_sharded.py-968-
used to crash with the exception
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp:1528: Kt % in0_block_w == 0
With the newest changes, the test goes into an infinite loop on the device, and watcher does not report anything amiss. Will look into what is happening.
For my last comment, it turns out that noc_semaphore_set_multicast_loopback_src is a no-op if there is only one core in the grid and the core is the one doing the sending. It also turns out that noc_async_write_multicast hangs if called with 0 cores. Therefore these are not called in that scenario, and the semaphore is written directly on the core.
With https://github.com/tenstorrent/tt-metal/pull/9341 that is now fixed and at least another 200 width sharded sweep tests should pass. I did not run the other sweeps after this change.
After https://github.com/tenstorrent/tt-metal/pull/9387, which tries to set up simple conditions instead of failing with a fatal call, another ~180 width sharded tests should pass.
Also, for block sharded tests, the cumulative changes of the changes is another ~1200 tests passing.
Most up to date numbers: Height: 2058/2432 passed (pretty stale) Width: 2729/3168 passed Block: at least 3208/17952 passed
The number of failing/crashing tests has improved from over ~17k to ~15.5k
There are still 8 no space for circular buffer failures for matmul_default_interleaved.py They are for test indices 1982-1989 and can be triggered via:
pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1982-
...
pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1989-
...
Many of the remaining failures are due to running out of memory/space.
After the latest change from @TT-BrianLiu (https://github.com/tenstorrent/tt-metal/pull/9388) the number of failing/crashing tests has decreased to ~14.5k:
interleaved: 1992 passed 0 failed 8 crashed height: 2376 passed 0 failed 56 crashed width: 2732 passed 264 failed 172 crashed block: 3880 passed 56 failed 14016 crashed
For height sharded, there are two types of crashes, with 24 and 32 tests having them respectively:
matmul/full/matmul_default_height_sharded.py|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:942: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED
matmul/full/matmul_default_height_sharded.py|2024-06-13_22-23-56|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1014: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
After https://github.com/tenstorrent/tt-metal/pull/9447 block: 9712 passed, 8240 not passed
That means that PR fixed ~4k tests.
Across the 4 full sweeps:
Interleaved: 2000 passed Width: 2848 passed, 185 failed, 135 crashed (3168 total) Height: 2392 passed, 40 crashed (2432 total) Block: 9712 passed, 8240 crashed (17952 total)
The crashes for Block Sharded can be grouped into the following. The big issue that's left is "1D mcast for in0 or in1 is not implemented yet." which @TT-BrianLiu is looking into.
sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")>1 group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1029: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::HEIGHT_SHARDED
backt|536
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1212: (input_tensor_a.get_legacy_shape()[-1] / TILE_WIDTH) % program_config.in0_block_w == 0
i|448
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:988: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED
backtra|852
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1299: num_blocks_x <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1301: num_blocks_y <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1343: false
info:
1D m|4196
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1345: false
info:
Grid|242
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")<=1 group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |380
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|940
TT_THROW @ ../tt_metal/impl/program/program.cpp:505: tt::|546
For Width Sharded:
sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")>1 group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1024: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|53
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")<=1 group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |7
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|75
For Height Sharded:
sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_503190626 where status = "crashed" and instr(exception,"TT_FATAL @")>1 group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1060: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|40
Starting to look at this issue again.
The sweeps have been moved to the new framework.
I ran the block sharded sweep on Grayskull while removing the double buffering.
One thing that the new framework picked up is that there are duplicates in the way that the inputs are generated.
Therefore for block sharded there are 16896 test cases:
The tests showed that:
Will continue to look at the failures.
For the 992 other reasons:
I'll created https://github.com/tenstorrent/tt-metal/issues/11392 to track fixing those problems.
I ran width and height sharded as well.
width:
height:
For the PCC > 0.95 I will look at if which shapes cause those and have shape size specific PCCs. For the other ones, I'll need to investigate. I'll create a follow up issue after dealing with the block sharded failures.
For width sharded, the PCCs > 0.95 seem to be for input a and output being 8 bit.
https://github.com/tenstorrent/tt-metal/issues/11392 has addressed the block sharded failures, and fixed some of the width sharded failures.
Updated info:
block:
width:
Dealing with the CB allocation issue will be a larger project.
Next will be an investigation into the really bad PCCs. That is being done as part of https://github.com/tenstorrent/tt-metal/issues/10936 which independently saw bad PCCs in a specific setup of running a model.
For the 40 height sharded failures, it looks like the new sweeps framework is somehow set up in a way that skips calling validate(). The fix is probably to change the program config to be MatmulMultiCoreReuseProgramConfig
Describe the bug Some matmul sweep tests are failing
Issue purpose
To Reproduce Steps to reproduce the behaviour: Run matmul sweeps
Expected behavior All sweeps pass
Please complete the following environment information:
Additional context
The sweep tests not listed below passed.
Tests run via e.g.
python tests/ttnn/sweep_tests/run_sweeps.py --include matmul/short/matmul.py
matmul/short/matmul.py
matmul/full/matmul_default_interleaved.py
matmul/full/matmul_default_height_sharded.py
Matmul program config could not be determined for given input shapes!
matmul/full/matmul_default_width_sharded.py
Kt % in0_block_w == 0
assertion failuresmatmul/full/matmul_default_block_sharded.py
8012 shape failures
990 no space for circular buffer failures
280
per_core_N must be greater than 0!
failures6160
Num cores along x must match provided grid size!
failures392
Num total blocks must be smaller than num blocks y and num blocks x
failures[x] #9449