tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
415 stars 53 forks source link

Fix BH test_tilize_matmul hang #12602

Open bbradelTT opened 2 weeks ago

bbradelTT commented 2 weeks ago

For BH misc/test_tilize_matmul.py hanging

bbradelTT commented 3 days ago

Uses MatmulMultiCoreProgramConfig

pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
...
                     Op | DEBUG    | Launching Device Operation: "Matmul"
                     Op | DEBUG    | Program Hash: 0
                     Op | DEBUG    | Program Cache Hit: false
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    program_config = std::nullopt
                     Op | DEBUG    |    bcast_batch = 1
                     Op | DEBUG    |    output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT16
                     Op | DEBUG    |    compute_kernel_config = GrayskullComputeKernelConfig(math_fidelity=HiFi2,math_approx_mode=1)
                     Op | DEBUG    |    untilize_out = false
                     Op | DEBUG    |    user_core_coord = std::nullopt
                     Op | DEBUG    |    user_fused_activation = std::nullopt
                     Op | DEBUG    |    user_run_batched = false
                     Op | DEBUG    |    transpose_a = false
                     Op | DEBUG    |    transpose_b = false
                     Op | DEBUG    |    output_tile = std::nullopt
                     Op | DEBUG    | Tensors Args:
                     Op | DEBUG    |    0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))
                     Op | DEBUG    |    1: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))
                     Op | DEBUG    | 
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreProgramConfig()
                     Op | DEBUG    | Auto generated program config: MatmulMultiCoreProgramConfig()
bbradelTT commented 3 days ago

The config is identical to the first test in https://github.com/tenstorrent/tt-metal/issues/12604#issuecomment-2374653308 and yet that one passes and this one fails.

From the comment:

2024-09-25 16:47:36.470 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:89 - matmul test passed with input shape [[1, 1, 32, 32], [1, 1, 32, 32]].
PASSED                  Metal | INFO     | Closing device 0
bbradelTT commented 3 days ago

Watcher output:

Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 2): CWFW,   W,   W,   D,   D  rmsg:D1G|BNT smsg:DDGG k_ids:7|6|8
Device 0 worker core(x= 1,y= 0) phys(x= 2,y= 2):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
...
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[6]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_8bank_output_tiles_partitioned.cpp
k_id[7]: ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[8]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm.cpp
bbradelTT commented 3 days ago

The test is not hanging. Just taking 20+ minutes:

bbradel@yyzo-bh-02-special-bbradel-for-reservation-3967926:/proj_sw/user_dev/bbradel/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
2024-09-25 18:37:16.997 | DEBUG    | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/bbradel/.cache/ttnn,model_cache_path=/home/bbradel/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
2024-09-25 18:37:17.237 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-09-25 18:37:17.240 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
2024-09-25 18:37:17.248 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 18:37:17.248 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reshape be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.squeeze be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_torch be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_torch be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.deallocate be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reallocate be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.load_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.dump_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.as_tensor be migrated to C++?
2024-09-25 18:37:17.255 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 18:37:17.263 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 18:37:17.263 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 18:37:17.264 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
=================================================== test session starts ===================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/user_dev/bbradel/tt-metal/python_env/bin/python3
cachedir: .pytest_cache
rootdir: /proj_sw/user_dev/bbradel/tt-metal, configfile: pytest.ini
plugins: dash-2.15.0, xdist-3.6.1, timeout-2.2.0, anyio-4.4.0, split-0.8.2
timeout: 300.0s
timeout method: signal
timeout func_only: False
collected 1 item                                                                                                          

tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test                  Device | INFO     | Opening user mode device driver
2024-09-25 18:37:17.489 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
              LLRuntime | INFO     | Watcher log file: /proj_sw/user_dev/bbradel/tt-metal/generated/watcher/watcher.log
              LLRuntime | INFO     | Watcher attached device 0
              LLRuntime | INFO     | Watcher server initialized, disabled features: None
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
kkkkkk              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
              LLRuntime | INFO     | Watcher checking device 0
2024-09-25 19:03:20.064 | ERROR    | models.utility_functions:comp_pcc:602 - One tensor is all zero
2024-09-25 19:03:20.064 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:48 - Passing=False
2024-09-25 19:03:20.064 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:49 - Output pcc=0.0
FAILED                  Metal | INFO     | Closing device 0
              LLRuntime | INFO     | Watcher detached device 0
              LLRuntime | INFO     | Watcher thread stopped watching...
                  Metal | INFO     | Disabling and clearing program cache on device 0

======================================================== FAILURES =========================================================
_______________________________________________ test_run_tilize_matmul_test _______________________________________________

device = <ttnn._ttnn.device.Device object at 0x7fce2e435d70>

    def test_run_tilize_matmul_test(device):
>       run_tilize_matmul_test(32, 32, 32, device)

tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:54: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

M = 32, K = 32, N = 32, device = <ttnn._ttnn.device.Device object at 0x7fce2e435d70>

    def run_tilize_matmul_test(M, K, N, device):
        a_shape = [1, 1, M, K]
        b_shape = [1, 1, K, N]

        A = torch.randn(a_shape)
        B = torch.randn(b_shape) - 0.95

        a = ttnn.Tensor(
            A.flatten().tolist(),
            a_shape,
            ttnn.bfloat16,
            ttnn.ROW_MAJOR_LAYOUT,
            device,
        )
        a_t = ttnn.tilize(a)
        b_t = ttnn.Tensor(
            tilize_to_list(B),
            b_shape,
            ttnn.bfloat16,
            ttnn.TILE_LAYOUT,
            device,
        )
        t2 = ttnn.matmul(a_t, b_t)
        assert list(t2.get_legacy_shape()) == [1, 1, M, N]
        tt_host_rm = t2.cpu().to_torch()
        pyt_got_back = tt_host_rm.reshape((1, 1, M, N))
        pyt_got_back_rm = untilize(pyt_got_back)

        ref_bmm = torch.matmul(A.reshape(1, M, K), B.reshape(1, K, N))
        ref_bmm = ref_bmm.reshape(1, 1, M, N)
        passing_pcc, output_pcc = comp_pcc(ref_bmm, pyt_got_back_rm, 0.99)
        logger.debug(f"Passing={passing_pcc}")
        logger.debug(f"Output pcc={output_pcc}")
>       assert passing_pcc
E       assert False

tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:50: AssertionError
==================================================== warnings summary =====================================================
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
  tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:53: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
    def test_run_tilize_matmul_test(device):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
----------- generated xml file: /proj_sw/user_dev/bbradel/tt-metal/generated/test_reports/most_recent_tests.xml -----------
==================================================== slowest durations ====================================================
1555.74s call     tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
6.97s setup    tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
0.00s teardown tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
================================================= short test summary info =================================================
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test - assert False
======================================== 1 failed, 1 warning in 1562.92s (0:26:02) ========================================
                 Device | INFO     | Closing user mode device drivers
bbradelTT commented 2 days ago

I made some changes to try to profile the test, and the test passed once in a few seconds:

(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ export TT_METAL_THREADCOUNT=16
(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ echo $TT_METAL_THREADCOUNT
16
(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ ./tt_metal/tools/profiler/profile_this.py -c "pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py"
2024-09-25 21:17:01.167 | WARNING  | __main__:main:38 - profile_this.py is getting deprecated soon. Please use the tracy.py module with -r option to obtain op reports.
2024-09-25 21:17:01.167 | INFO     | __main__:main:42 - profile_this.py is running pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
2024-09-25 21:17:02.910 | INFO     | __main__:main:89 - Using port 8086
2024-09-25 21:17:02.910 | INFO     | tracy:run_report_setup:87 - Verifying tracy profiling tools
2024-09-25 21:17:02.967 | INFO     | tracy:run_report_setup:108 - Capture command: ('/proj_sw/user_dev/bbradel/tt-metal/build/tools/profiler/bin/capture-release -o /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_profile_log_host.tracy -f -p 8086',)
2024-09-25 21:17:02.975 | INFO     | __main__:main:146 - Test process started
Connecting to 127.0.0.1:8086...2024-09-25 21:17:04.511 | DEBUG    | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/bbradel/.cache/ttnn,model_cache_path=/home/bbradel/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}

Queue delay: 13 ns
Timer resolution: 0 ns
   6.98 Mbps /  9.8% =  71.21 Mbps | Tx: 1548.91 KB | 66.94 MB | 2.21 s2024-09-25 21:17:06.519 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-09-25 21:17:06.522 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
2024-09-25 21:17:06.533 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reshape be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.squeeze be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_torch be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_torch be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.deallocate be migrated to C++?
2024-09-25 21:17:06.534 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reallocate be migrated to C++?
2024-09-25 21:17:06.535 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.load_tensor be migrated to C++?
2024-09-25 21:17:06.535 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.dump_tensor be migrated to C++?
2024-09-25 21:17:06.535 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.as_tensor be migrated to C++?
2024-09-25 21:17:06.541 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 21:17:06.550 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 21:17:06.551 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 21:17:06.551 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
   5.46 Mbps / 12.3% =  44.27 Mbps | Tx: 2655.23 KB | 67.88 MB | 3.74 s=================================================== test session starts ===================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/user_dev/bbradel/tt-metal/python_env/bin/python
cachedir: .pytest_cache
rootdir: /proj_sw/user_dev/bbradel/tt-metal, configfile: pytest.ini
plugins: dash-2.15.0, xdist-3.6.1, timeout-2.2.0, anyio-4.4.0, split-0.8.2
timeout: 300.0s
timeout method: signal
timeout func_only: False
collected 1 item                                                                                                          

tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test                  Device | INFO     | Opening user mode device driver
  10.77 Mbps / 12.5% =  86.08 Mbps | Tx: 2920.85 KB | 67.9 MB | 3.84 s2024-09-25 21:17:08.177 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
   5.22 Mbps / 14.1% =  36.88 Mbps | Tx: 3272.49 KB | 68.29 MB | 4.55 s                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
  13.30 Mbps / 10.5% = 126.76 Mbps | Tx: 13.68 MB | 140.4 MB | 16.94 s2024-09-25 21:17:21.273 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:48 - Passing=True
2024-09-25 21:17:21.273 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:49 - Output pcc=0.9999909687232241
PASSED                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
  13.30 Mbps / 10.5% = 126.76 Mbps | Tx: 13.68 MB | 140.42 MB | 17.04 s                  Metal | INFO     | Closing device 0
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | Disabling and clearing program cache on device 0

==================================================== warnings summary =====================================================
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
  tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:53: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
    def test_run_tilize_matmul_test(device):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================= PASSES ==========================================================
----------- generated xml file: /proj_sw/user_dev/bbradel/tt-metal/generated/test_reports/most_recent_tests.xml -----------
==================================================== slowest durations ====================================================
7.10s setup    tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
6.13s call     tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
0.10s teardown tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
================================================= short test summary info =================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
============================================== 1 passed, 1 warning in 13.36s ==============================================
   6.12 Mbps / 17.4% =  35.21 Mbps | Tx: 13.99 MB | 140.52 MB | 17.28 s                 Device | INFO     | Closing user mode device drivers
   6.12 Mbps / 17.4% =  35.21 Mbps | Tx: 13.99 MB | 140.62 MB | 17.39 s                 Always | WARNING  | For device 0, core 1,2 default frequency was used and its zones will be out of sync
   9.42 Mbps /  5.6% = 167.73 Mbps | Tx: 19.48 MB | 144.68 MB | 23.11 s
Frames: 2
Time span: 23.11 s
Zones: 1,280
Elapsed time: 29.83 s
Saving trace... done!
Trace size 32.34 MB (30.48% ratio)
2024-09-25 21:17:35.101 | INFO     | tracy:generate_report:153 - Host side ops time report generated at /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_ops_times.csv
2024-09-25 21:17:35.307 | INFO     | tracy:generate_report:164 - Host side ops data report generated at /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_ops_data.csv
2024-09-25 21:17:35.307 | INFO     | tt_metal.tools.profiler.process_ops_logs:import_tracy_op_logs:81 - Importing ops logs
2024-09-25 21:17:35.314 | INFO     | tt_metal.tools.profiler.process_ops_logs:get_device_op_data:161 - Getting device ops
2024-09-25 21:17:35.314 | INFO     | tt_metal.tools.profiler.process_ops_logs:append_device_data:183 - Appending device data
2024-09-25 21:17:35.315 | INFO     | tt_metal.tools.profiler.process_device_log:core_to_device_timeseries:360 - Importing Data For Device Number : 0
2024-09-25 21:17:35.342 | INFO     | tt_metal.tools.profiler.process_ops_logs:generate_reports:330 - OPs' perf analysis is finished! Generating reports ...
2024-09-25 21:17:35.342 | INFO     | tt_metal.tools.profiler.process_ops_logs:generate_reports:348 - Copying runtime artifacts
2024-09-25 21:17:35.669 | INFO     | tt_metal.tools.profiler.process_ops_logs:generate_reports:367 - Generating OPs CSV
2024-09-25 21:17:35.671 | INFO     | tt_metal.tools.profiler.process_ops_logs:generate_reports:545 - OPs csv generated at: /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/reports/2024_09_25_21_17_35/ops_perf_results_2024_09_25_21_17_35.csv

Definitely non deterministic behaviour. There may be some kind of timing issue.

bbradelTT commented 1 day ago

I added extra DPRINTs:

diff --git a/tt_metal/hw/inc/dataflow_api.h b/tt_metal/hw/inc/dataflow_api.h
index b46cec89b7..e8d7522fa4 100644
--- a/tt_metal/hw/inc/dataflow_api.h
+++ b/tt_metal/hw/inc/dataflow_api.h
@@ -27,6 +27,7 @@
 #include "third_party/umd/device/tt_silicon_driver_common.hpp"
 #include "debug/assert.h"
 #include "dev_msgs.h"
+#include "debug/dprint.h"

 extern uint8_t noc_index;
 extern uint32_t tt_l1_ptr *rta_l1_base;
@@ -268,7 +269,9 @@ void cb_push_back(const int32_t operand, const int32_t num_pages) {
     uint32_t num_words = num_pages * cb_interface[operand].fifo_page_size;

     volatile tt_reg_ptr uint32_t* pages_received_ptr = get_cb_tiles_received_ptr(operand);
+    DPRINT << "CB_PUSH_BACK " << operand << " " << num_pages << " " << pages_received_ptr[0] << ENDL();
     pages_received_ptr[0] += num_pages;
+    DPRINT << "CB_PUSH_BACK2 " << operand << " " << num_pages << " " << pages_received_ptr[0] << ENDL();

     cb_interface[operand].fifo_wr_ptr += num_words;

@@ -476,6 +479,7 @@ void cb_wait_front(int32_t operand, int32_t num_pages) {
     do {
         invalidate_l1_cache();
         pages_received = ((uint16_t)reg_read(pages_received_ptr)) - pages_acked;
+        DPRINT << "CB_WAIT_FRONT " << pages_acked << " " << (uint16_t)reg_read(pages_received_ptr) << " " << pages_received << ENDL();
     } while (pages_received < num_pages);
     WAYPOINT("CWFD");
 }

the problem is that reg_read(pages_received_ptr) is reading 0 instead of 1.

Good:

a 0 0 1
a1 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
cmd_write_packed
cmd_wait
 DISPATCH WAIT 1a4b0 count 1
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1 1
CB_WAIT_FRONT 0 1 0
CB_WAIT_FRONT 0 1 1
b 0 0 1
b1 0 1 1
c 0 0 1
cmd_write_paged is_dram: 1 
d 0 0 1
d1 0 1 1
d2 6854 6982
e 0 0 1
e1 1 1 0
e2 6854 6982
cmd_wait
 DISPATCH WAIT 1a4b0 count 1
cmd_write_paged is_dram: 1
write offset: 0 102944 0
cmd_write_packed
cmd_write_packed
cmd_wait
 DISPATCH BARRIER
 DISPATCH WAIT 1a4b0 count 1
cmd_write_packed_large
cmd_wait
 DISPATCH BARRIER
cmd_write_packed
cmd_wait
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1 1a  DISPATCH BARRIER
 DISPATCH WAIT 1a4b0 count 2

CB_PUSH_BACK 1 1 0
CB_PUSH_BACK2 1 1 1
0 0 1
a1 0 1 1
CB_WAIT_FRONT 0 1 1
b 0 0 1
...

Bad:

cmd_wait
 DISPATCH BARRIER
 DISPATCH WAIT 1dbd0 count 1
cmd_write_packed_large
cmd_wait
 x0
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1XX0
a 0 0 1  DISPATCH BARRIER
cmd_write_packed
 1
 x1
CB_PUSH_BACK 1 1 0
CB_PUSH_BACK2 1 1 1
Pushed itileA=0 itileB=0

XX5
XX6
XX7
XX8
XX9
AA0
AA1
XXA
16
a1 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
bbradelTT commented 1 day ago

Adding in PACK (DPRINT...) statements, instead of just DPRINT statements, in the math kernel shows that when the hang does occur, then cb_push_back is not called (in a reasonable time frame) in the math kernel. Therefore cb_wait_front seems to be doing the right thing.

This only consistently hangs when both tilize in the python test and matmul_tiles() in the math kernel ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm.cpp are called.

if