tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
428 stars 57 forks source link

Some matmul sweep tests are failing #9059

Open bbradelTT opened 4 months ago

bbradelTT commented 4 months ago

Describe the bug Some matmul sweep tests are failing

Issue purpose

  1. Analyze errors
  2. Fix low hanging fruit
  3. Spawn off separate issues for harder to deal with problems

To Reproduce Steps to reproduce the behaviour: Run matmul sweeps

Expected behavior All sweeps pass

Please complete the following environment information:

Additional context

The sweep tests not listed below passed.

Tests run via e.g. python tests/ttnn/sweep_tests/run_sweeps.py --include matmul/short/matmul.py

matmul/short/matmul.py

matmul/full/matmul_default_interleaved.py

matmul/full/matmul_default_height_sharded.py

matmul/full/matmul_default_width_sharded.py

matmul/full/matmul_default_block_sharded.py

bbradelTT commented 4 months ago

matmul/short/matmul.py

8 PCC too low failures

Will need to update PCC: pcc .999 -> .9988 fixes failures

bbradelTT commented 4 months ago

I re-ran sweeps, including a small change I haven't merged yet. Things haven't moved much. I'll keep on looking at individual failures.

matmul/full/matmul_default_width_sharded.py|crashed|814
matmul/full/matmul_default_width_sharded.py|passed|2354

matmul/full/matmul_default_height_sharded.py|crashed|374
matmul/full/matmul_default_height_sharded.py|passed|2058

matmul/full/matmul_default_block_sharded.py|crashed|15834
matmul/full/matmul_default_block_sharded.py|passed|2118
bbradelTT commented 4 months ago

pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_width_sharded -k matmul_default_width_sharded.py-968- used to crash with the exception

TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp:1528: Kt % in0_block_w == 0

With the newest changes, the test goes into an infinite loop on the device, and watcher does not report anything amiss. Will look into what is happening.

bbradelTT commented 4 months ago

For my last comment, it turns out that noc_semaphore_set_multicast_loopback_src is a no-op if there is only one core in the grid and the core is the one doing the sending. It also turns out that noc_async_write_multicast hangs if called with 0 cores. Therefore these are not called in that scenario, and the semaphore is written directly on the core.

With https://github.com/tenstorrent/tt-metal/pull/9341 that is now fixed and at least another 200 width sharded sweep tests should pass. I did not run the other sweeps after this change.

bbradelTT commented 4 months ago

After https://github.com/tenstorrent/tt-metal/pull/9387, which tries to set up simple conditions instead of failing with a fatal call, another ~180 width sharded tests should pass.

Also, for block sharded tests, the cumulative changes of the changes is another ~1200 tests passing.

Most up to date numbers: Height: 2058/2432 passed (pretty stale) Width: 2729/3168 passed Block: at least 3208/17952 passed

The number of failing/crashing tests has improved from over ~17k to ~15.5k

bbradelTT commented 4 months ago

There are still 8 no space for circular buffer failures for matmul_default_interleaved.py They are for test indices 1982-1989 and can be triggered via:

pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1982-
...
pytest tests/ttnn/sweep_tests/test_sweeps.py::test_matmul_default_interleaved -k matmul_default_interleaved.py-1989-
...

Many of the remaining failures are due to running out of memory/space.

bbradelTT commented 3 months ago

After the latest change from @TT-BrianLiu (https://github.com/tenstorrent/tt-metal/pull/9388) the number of failing/crashing tests has decreased to ~14.5k:

interleaved: 1992 passed 0 failed 8 crashed height: 2376 passed 0 failed 56 crashed width: 2732 passed 264 failed 172 crashed block: 3880 passed 56 failed 14016 crashed

bbradelTT commented 3 months ago

For height sharded, there are two types of crashes, with 24 and 32 tests having them respectively:

matmul/full/matmul_default_height_sharded.py|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:942: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED
matmul/full/matmul_default_height_sharded.py|2024-06-13_22-23-56|crashed|Exception: TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1014: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
bbradelTT commented 3 months ago

After https://github.com/tenstorrent/tt-metal/pull/9447 block: 9712 passed, 8240 not passed

That means that PR fixed ~4k tests.

bbradelTT commented 3 months ago

Across the 4 full sweeps:

Interleaved: 2000 passed Width: 2848 passed, 185 failed, 135 crashed (3168 total) Height: 2392 passed, 40 crashed (2432 total) Block: 9712 passed, 8240 crashed (17952 total)

bbradelTT commented 3 months ago

The crashes for Block Sharded can be grouped into the following. The big issue that's left is "1D mcast for in0 or in1 is not implemented yet." which @TT-BrianLiu is looking into.

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1029: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::HEIGHT_SHARDED
backt|536
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1212: (input_tensor_a.get_legacy_shape()[-1] / TILE_WIDTH) % program_config.in0_block_w == 0
i|448
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:988: input_tensor_a.memory_config().memory_layout == TensorMemoryLayout::WIDTH_SHARDED
backtra|852
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1299: num_blocks_x <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1301: num_blocks_y <= |50
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1343: false
info:
1D m|4196
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp:1345: false
info:
Grid|242
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_146740210 where status = "crashed" and instr(exception,"TT_FATAL @")<=1  group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |380
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|940
TT_THROW @ ../tt_metal/impl/program/program.cpp:505: tt::|546

For Width Sharded:

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1024: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|53
sqlite> select substr(exception,instr(exception,"TT_THROW @"),57), count(*) from table_170595335 where status = "crashed" and instr(exception,"TT_FATAL @")<=1  group by substr(exception,instr(exception,"TT_THROW @"),57);
TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:118: |7
TT_THROW @ ../tt_metal/impl/program/program.cpp:496: tt::|75

For Height Sharded:

sqlite> select substr(exception,instr(exception,"TT_FATAL @"),150), count(*) from table_503190626 where status = "crashed" and instr(exception,"TT_FATAL @")>1  group by substr(exception,instr(exception,"TT_FATAL @"),150);
TT_FATAL @ ../tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:1060: program_config.out_subblock_w == per_core_N || program_config.out_subblock_h == 1
backtr|40
bbradelTT commented 1 month ago

Starting to look at this issue again.

The sweeps have been moved to the new framework.

I ran the block sharded sweep on Grayskull while removing the double buffering.

One thing that the new framework picked up is that there are duplicates in the way that the inputs are generated.

Therefore for block sharded there are 16896 test cases:

The tests showed that:

Will continue to look at the failures.

bbradelTT commented 1 month ago

For the 992 other reasons:

I'll created https://github.com/tenstorrent/tt-metal/issues/11392 to track fixing those problems.

bbradelTT commented 1 month ago

I ran width and height sharded as well.

width:

height:

For the PCC > 0.95 I will look at if which shapes cause those and have shape size specific PCCs. For the other ones, I'll need to investigate. I'll create a follow up issue after dealing with the block sharded failures.

bbradelTT commented 1 month ago

For width sharded, the PCCs > 0.95 seem to be for input a and output being 8 bit.

bbradelTT commented 1 month ago

https://github.com/tenstorrent/tt-metal/issues/11392 has addressed the block sharded failures, and fixed some of the width sharded failures.

Updated info:

block:

width:

Dealing with the CB allocation issue will be a larger project.

Next will be an investigation into the really bad PCCs. That is being done as part of https://github.com/tenstorrent/tt-metal/issues/10936 which independently saw bad PCCs in a specific setup of running a model.

bbradelTT commented 1 month ago

For the 40 height sharded failures, it looks like the new sweeps framework is somehow set up in a way that skips calling validate(). The fix is probably to change the program config to be MatmulMultiCoreReuseProgramConfig