tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
400 stars 50 forks source link

[N300] TTNN Unit Test Failures: Compute Grid sizes #6984

Open cfjchu opened 5 months ago

cfjchu commented 5 months ago

Failure:

- RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/sharded/sharded_op.cpp:42: this->grid_size.x <= device_grid.x && this->grid_size.y <= device_grid.y

repro: There are a number of tests failing with this failure condition:

pytest -svv tests/ttnn/unit_tests/operations/test_group_norm_v2.py::test_group_norm_with_block_sharded_v2_8x8_grid[N=2-C=320-H=64-W=64-num_groups=32]

fyi @arakhmati @xanderchin @jliangTT

cfjchu commented 5 months ago

Full list:

FAILED tests/ttnn/unit_tests/operations/test_linear.py::test_linear_with_core_grid[core_grid=False-use_bias=True-n_size=1024-k_size=1024-m_size=384-batch_size=8] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_linear.py::test_linear_with_core_grid[core_grid=False-use_bias=False-n_size=1024-k_size=1024-m_size=384-batch_size=8] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=1024-k_size=640-n_size=2560-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=64-k_size=96-n_size=160-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=4096-k_size=320-n_size=1280-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=64-k_size=1280-n_size=5120-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=64-k_size=64-n_size=160-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=1024-k_size=640-n_size=768-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=96-k_size=160-n_size=96-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=1024-k_size=1024-n_size=96-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=96-k_size=768-n_size=1024-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=1-channel_b=1-m_size=32-k_size=1280-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=4096-k_size=96-n_size=64-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=64-k_size=5120-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=4096-k_size=64-n_size=96-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=1024-k_size=768-n_size=640-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=256-k_size=1280-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=1024-k_size=96-n_size=96-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=1024-k_size=640-n_size=2304-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=1-channel_b=1-m_size=32-k_size=1280-n_size=320-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=96-k_size=768-n_size=2560-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=4096-k_size=1280-n_size=320-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=1024-k_size=2560-n_size=640-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=256-k_size=1280-n_size=3840-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=1-channel_b=1-m_size=32-k_size=320-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=4096-k_size=512-n_size=320-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=64-k_size=1280-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=256-k_size=5120-n_size=1280-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=256-k_size=1280-n_size=1280-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=256-k_size=160-n_size=96-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=256-k_size=256-n_size=160-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=96-k_size=768-n_size=1536-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=64-k_size=1280-n_size=3840-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=256-k_size=96-n_size=160-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=64-k_size=1280-n_size=1280-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=1-channel_b=1-m_size=32-k_size=1280-n_size=640-has_bias=True] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=64-k_size=160-n_size=64-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=4096-k_size=320-n_size=1536-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=256-k_size=1280-n_size=5120-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=4096-k_size=4096-n_size=64-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=2-channel_a=8-channel_b=8-m_size=256-k_size=160-n_size=256-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul[dtype=DataType.BFLOAT8_B-batch_size=1-channel_a=2-channel_b=1-m_size=4096-k_size=320-n_size=512-has_bias=False] - RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:851: program_config.compute_with_storage_grid_size.y <= input_tensor_a.device()->compute_with_storage_grid_size().y
tt-nshanker commented 5 months ago

@yugaoTT This is a failure on WH for group norm unit test. Can you please take a look at that? The other failures are not related to GN.

yugaoTT commented 5 months ago

@tt-nshanker didn't see any failure on my local machine, @cfjchu are you using harvested machine?

cfjchu commented 5 months ago

@yugaoTT Yes, I've tagged it in the issue.. N300 is two-row harvested and it's running in CI. Here's the run where it had failed: https://github.com/tenstorrent-metal/tt-metal/actions/runs/8515181276/job/23322247246

Search for "test_group_norm_v2" and you'll see a number of failing variants

yugaoTT commented 5 months ago

yeah problem is harvested machine doesn't have 8x8 grid, and the test is for 8x8 test, should we bypass the test if grid size is not 8x8?

cfjchu commented 5 months ago

yeah problem is harvested machine doesn't have 8x8 grid, and the test is for 8x8 test, should we bypass the test if grid size is not 8x8?

If it's only intended to work for 8x8 and we're okay losing the coverage on N300 then yes.

tt-nshanker commented 5 months ago

@yugaoTT Can you please add a pytest skip if WH machine does not have 8 worker cores and enable this test for WH machines that have 8? Something like - if wormhole and device.core_grid.y == 7: pytest.skip()

yugaoTT commented 5 months ago

looks like it is already in main if device.core_grid.y == 7: pytest.skip()

jliangTT commented 5 months ago

have we tested this change and see if we can close this?