Some matmul sweep tests are failing

bbradelTT commented 4 months ago

Describe the bug Some matmul sweep tests are failing

Issue purpose

Analyze errors
Fix low hanging fruit
Spawn off separate issues for harder to deal with problems

To Reproduce Steps to reproduce the behaviour: Run matmul sweeps

Expected behavior All sweeps pass

Please complete the following environment information:

OS: Ubuntu 20.04
Version of software: main

Additional context

The sweep tests not listed below passed.

Tests run via e.g. python tests/ttnn/sweep_tests/run_sweeps.py --include matmul/short/matmul.py

matmul/short/matmul.py

8 PCC too low failures

matmul/full/matmul_default_interleaved.py

8 no space for circular buffer failures

matmul/full/matmul_default_height_sharded.py

416 shape failures: Matmul program config could not be determined for given input shapes!

matmul/full/matmul_default_width_sharded.py

356 shape failures
7 no space for circular buffer failures
464 Kt % in0_block_w == 0 assertion failures

matmul/full/matmul_default_block_sharded.py

8012 shape failures
990 no space for circular buffer failures
280 per_core_N must be greater than 0! failures
6160 Num cores along x must match provided grid size! failures
392 Num total blocks must be smaller than num blocks y and num blocks x failures
[x] #9449

bbradelTT commented 4 months ago

matmul/short/matmul.py

8 PCC too low failures

Will need to update PCC: pcc .999 -> .9988 fixes failures

bbradelTT commented 4 months ago

I re-ran sweeps, including a small change I haven't merged yet. Things haven't moved much. I'll keep on looking at individual failures.

matmul/full/matmul_default_width_sharded.py|crashed|814
matmul/full/matmul_default_width_sharded.py|passed|2354

matmul/full/matmul_default_height_sharded.py|crashed|374
matmul/full/matmul_default_height_sharded.py|passed|2058

matmul/full/matmul_default_block_sharded.py|crashed|15834
matmul/full/matmul_default_block_sharded.py|passed|2118

bbradelTT commented 4 months ago

pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_width_sharded -k matmul_default_width_sharded.py-968- used to crash with the exception

TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp:1528: Kt % in0_block_w == 0

With the newest changes, the test goes into an infinite loop on the device, and watcher does not report anything amiss. Will look into what is happening.

bbradelTT commented 4 months ago

For my last comment, it turns out that noc_semaphore_set_multicast_loopback_src is a no-op if there is only one core in the grid and the core is the one doing the sending. It also turns out that noc_async_write_multicast hangs if called with 0 cores. Therefore these are not called in that scenario, and the semaphore is written directly on the core.

With https://github.com/tenstorrent/tt-metal/pull/9341 that is now fixed and at least another 200 width sharded sweep tests should pass. I did not run the other sweeps after this change.

bbradelTT commented 4 months ago

After https://github.com/tenstorrent/tt-metal/pull/9387, which tries to set up simple conditions instead of failing with a fatal call, another ~180 width sharded tests should pass.

Also, for block sharded tests, the cumulative changes of the changes is another ~1200 tests passing.

Most up to date numbers: Height: 2058/2432 passed (pretty stale) Width: 2729/3168 passed Block: at least 3208/17952 passed

The number of failing/crashing tests has improved from over ~17k to ~15.5k

bbradelTT commented 4 months ago

There are still 8 no space for circular buffer failures for matmul_default_interleaved.py They are for test indices 1982-1989 and can be triggered via:

pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1982-
...
pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1989-
...

Many of the remaining failures are due to running out of memory/space.

bbradelTT commented 3 months ago

After the latest change from @TT-BrianLiu (https://github.com/tenstorrent/tt-metal/pull/9388) the number of failing/crashing tests has decreased to ~14.5k:

interleaved: 1992 passed 0 failed 8 crashed height: 2376 passed 0 failed 56 crashed width: 2732 passed 264 failed 172 crashed block: 3880 passed 56 failed 14016 crashed

bbradelTT commented 3 months ago

For height sharded, there are two types of crashes, with 24 and 32 tests having them respectively:

matmul/full/matmul_default_height_sharded.py|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:942: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED

matmul/full/matmul_default_height_sharded.py|2024-06-13_22-23-56|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1014: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1

bbradelTT commented 3 months ago

After https://github.com/tenstorrent/tt-metal/pull/9447 block: 9712 passed, 8240 not passed

That means that PR fixed ~4k tests.

bbradelTT commented 3 months ago

Across the 4 full sweeps:

Interleaved: 2000 passed Width: 2848 passed, 185 failed, 135 crashed (3168 total) Height: 2392 passed, 40 crashed (2432 total) Block: 9712 passed, 8240 crashed (17952 total)

bbradelTT commented 3 months ago

The crashes for Block Sharded can be grouped into the following. The big issue that's left is "1D mcast for in0 or in1 is not implemented yet." which @TT-BrianLiu is looking into.

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1029: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::HEIGHT_SHARDED
backt|536
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1212: (input_tensor_a.get_legacy_shape()[-1] / TILE_WIDTH) % program_config.in0_block_w == 0
i|448
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:988: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED
backtra|852
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1299: num_blocks_x <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1301: num_blocks_y <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1343: false
info:
1D m|4196
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1345: false
info:
Grid|242
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")<=1  group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |380
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|940
TT_THROW @ ../tt_metal/impl/program/program.cpp:505: tt::|546

For Width Sharded:

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1024: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|53
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")<=1  group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |7
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|75

For Height Sharded:

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_503190626 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1060: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|40

bbradelTT commented 1 month ago

Starting to look at this issue again.

The sweeps have been moved to the new framework.

I ran the block sharded sweep on Grayskull while removing the double buffering.

One thing that the new framework picked up is that there are duplicates in the way that the inputs are generated.

Therefore for block sharded there are 16896 test cases:

11 different n size values
1536 different combinations for each n size

The tests showed that:

10703 - pass
684 - xfail - can't create tensors
5509 - fail
- 877 - pcc between 0.989 and 0.99
- 3640 - can't allocate CBs
- 992 - other reasons

Will continue to look at the failures.

bbradelTT commented 1 month ago

For the 992 other reasons:

640: ../ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_2d_program_factory.cpp:1300: Kt % in0_block_w == 0
208: ../ttnn/cpp/ttnn/operations/matmul/device/matmul_op.cpp:775: program_config.per_core_N > 0\ninfo:\nper_core_N must be greater than 0
72: ../ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_2d_program_factory.cpp:1324: num_blocks_y <= num_cores_y\ninfo:\nNum output blocks along y 8 must be smaller than or equal to the number of rows in compute grid 7
72: ../ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_2d_program_factory.cpp:1319: num_blocks_x <= num_cores_x\ninfo:\nNum output blocks along x 8 must be smaller than or equal to the number of columns in compute grid 7

I'll created https://github.com/tenstorrent/tt-metal/issues/11392 to track fixing those problems.

bbradelTT commented 1 month ago

I ran width and height sharded as well.

width:

2816 total
2392 pass
11 xfail
413 exception
- 143 - can't allocate CBs
- 270 - PCC errors, 264 of those have PCC > 0.95, 6 have really bad PCCs

height:

2272 total
2232 pass
40 fail - really bad PCC errors - looks like we replaced crashes with bad PCCs

For the PCC > 0.95 I will look at if which shapes cause those and have shape size specific PCCs. For the other ones, I'll need to investigate. I'll create a follow up issue after dealing with the block sharded failures.

bbradelTT commented 1 month ago

For width sharded, the PCCs > 0.95 seem to be for input a and output being 8 bit.

bbradelTT commented 1 month ago

https://github.com/tenstorrent/tt-metal/issues/11392 has addressed the block sharded failures, and fixed some of the width sharded failures.

Updated info:

block:

16896 total
10608 pass
684 xfail - can't create tensors
5604 - fail
- 5572 - can't allocate CBs
- 32 - too many output blocks along y (e.g. TT_FATAL @ ../ttnn/cpp/ttnn/operations/matmul/device/matmul_op_multi_core_reuse_mcast_2d_program_factory.cpp:1324: num_blocks_y <= num_cores_y\ninfo:\nNum output blocks along y 12 must be smaller than or equal to the number of rows in compute grid 9)

width:

2816 total
2540 pass
276 fail
- 12 - can't allocate CBs
- 264 pcc > 0.95

Dealing with the CB allocation issue will be a larger project.

Next will be an investigation into the really bad PCCs. That is being done as part of https://github.com/tenstorrent/tt-metal/issues/10936 which independently saw bad PCCs in a specific setup of running a model.

bbradelTT commented 1 month ago

For the 40 height sharded failures, it looks like the new sweeps framework is somehow set up in a way that skips calling validate(). The fix is probably to change the program config to be MatmulMultiCoreReuseProgramConfig

tenstorrent / tt-metal

Some matmul sweep tests are failing #9059