Open bbradelTT opened 2 weeks ago
Uses MatmulMultiCoreProgramConfig
pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
...
Op | DEBUG | Launching Device Operation: "Matmul"
Op | DEBUG | Program Hash: 0
Op | DEBUG | Program Cache Hit: false
Op | DEBUG | Attributes:
Op | DEBUG | program_config = std::nullopt
Op | DEBUG | bcast_batch = 1
Op | DEBUG | output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT16
Op | DEBUG | compute_kernel_config = GrayskullComputeKernelConfig(math_fidelity=HiFi2,math_approx_mode=1)
Op | DEBUG | untilize_out = false
Op | DEBUG | user_core_coord = std::nullopt
Op | DEBUG | user_fused_activation = std::nullopt
Op | DEBUG | user_run_batched = false
Op | DEBUG | transpose_a = false
Op | DEBUG | transpose_b = false
Op | DEBUG | output_tile = std::nullopt
Op | DEBUG | Tensors Args:
Op | DEBUG | 0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))
Op | DEBUG | 1: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4))
Op | DEBUG |
Op | DEBUG | Auto generated program config: MatmulMultiCoreProgramConfig()
Op | DEBUG | Auto generated program config: MatmulMultiCoreProgramConfig()
The config is identical to the first test in https://github.com/tenstorrent/tt-metal/issues/12604#issuecomment-2374653308 and yet that one passes and this one fails.
From the comment:
2024-09-25 16:47:36.470 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:89 - matmul test passed with input shape [[1, 1, 32, 32], [1, 1, 32, 32]].
PASSED Metal | INFO | Closing device 0
Watcher output:
Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 2): CWFW, W, W, D, D rmsg:D1G|BNT smsg:DDGG k_ids:7|6|8
Device 0 worker core(x= 1,y= 0) phys(x= 2,y= 2): GW, W, W, W, W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
...
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[6]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_8bank_output_tiles_partitioned.cpp
k_id[7]: ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[8]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm.cpp
The test is not hanging. Just taking 20+ minutes:
bbradel@yyzo-bh-02-special-bbradel-for-reservation-3967926:/proj_sw/user_dev/bbradel/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
2024-09-25 18:37:16.997 | DEBUG | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/bbradel/.cache/ttnn,model_cache_path=/home/bbradel/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
2024-09-25 18:37:17.237 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-09-25 18:37:17.240 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
2024-09-25 18:37:17.248 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 18:37:17.248 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.reshape be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.squeeze be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.from_torch be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.to_torch be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.to_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.from_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.deallocate be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.reallocate be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.load_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.dump_tensor be migrated to C++?
2024-09-25 18:37:17.249 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.as_tensor be migrated to C++?
2024-09-25 18:37:17.255 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 18:37:17.263 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 18:37:17.263 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 18:37:17.264 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
=================================================== test session starts ===================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/user_dev/bbradel/tt-metal/python_env/bin/python3
cachedir: .pytest_cache
rootdir: /proj_sw/user_dev/bbradel/tt-metal, configfile: pytest.ini
plugins: dash-2.15.0, xdist-3.6.1, timeout-2.2.0, anyio-4.4.0, split-0.8.2
timeout: 300.0s
timeout method: signal
timeout func_only: False
collected 1 item
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test Device | INFO | Opening user mode device driver
2024-09-25 18:37:17.489 | INFO | SiliconDriver - Detected 1 PCI device : [0]
Metal | INFO | Initializing device 0. Program cache is NOT enabled
Device | INFO | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
Metal | INFO | AI CLK for device 0 is: 800 MHz
LLRuntime | INFO | Watcher log file: /proj_sw/user_dev/bbradel/tt-metal/generated/watcher/watcher.log
LLRuntime | INFO | Watcher attached device 0
LLRuntime | INFO | Watcher server initialized, disabled features: None
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
kkkkkk LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
LLRuntime | INFO | Watcher checking device 0
2024-09-25 19:03:20.064 | ERROR | models.utility_functions:comp_pcc:602 - One tensor is all zero
2024-09-25 19:03:20.064 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:48 - Passing=False
2024-09-25 19:03:20.064 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:49 - Output pcc=0.0
FAILED Metal | INFO | Closing device 0
LLRuntime | INFO | Watcher detached device 0
LLRuntime | INFO | Watcher thread stopped watching...
Metal | INFO | Disabling and clearing program cache on device 0
======================================================== FAILURES =========================================================
_______________________________________________ test_run_tilize_matmul_test _______________________________________________
device = <ttnn._ttnn.device.Device object at 0x7fce2e435d70>
def test_run_tilize_matmul_test(device):
> run_tilize_matmul_test(32, 32, 32, device)
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:54:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
M = 32, K = 32, N = 32, device = <ttnn._ttnn.device.Device object at 0x7fce2e435d70>
def run_tilize_matmul_test(M, K, N, device):
a_shape = [1, 1, M, K]
b_shape = [1, 1, K, N]
A = torch.randn(a_shape)
B = torch.randn(b_shape) - 0.95
a = ttnn.Tensor(
A.flatten().tolist(),
a_shape,
ttnn.bfloat16,
ttnn.ROW_MAJOR_LAYOUT,
device,
)
a_t = ttnn.tilize(a)
b_t = ttnn.Tensor(
tilize_to_list(B),
b_shape,
ttnn.bfloat16,
ttnn.TILE_LAYOUT,
device,
)
t2 = ttnn.matmul(a_t, b_t)
assert list(t2.get_legacy_shape()) == [1, 1, M, N]
tt_host_rm = t2.cpu().to_torch()
pyt_got_back = tt_host_rm.reshape((1, 1, M, N))
pyt_got_back_rm = untilize(pyt_got_back)
ref_bmm = torch.matmul(A.reshape(1, M, K), B.reshape(1, K, N))
ref_bmm = ref_bmm.reshape(1, 1, M, N)
passing_pcc, output_pcc = comp_pcc(ref_bmm, pyt_got_back_rm, 0.99)
logger.debug(f"Passing={passing_pcc}")
logger.debug(f"Output pcc={output_pcc}")
> assert passing_pcc
E assert False
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:50: AssertionError
==================================================== warnings summary =====================================================
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:53: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
def test_run_tilize_matmul_test(device):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
----------- generated xml file: /proj_sw/user_dev/bbradel/tt-metal/generated/test_reports/most_recent_tests.xml -----------
==================================================== slowest durations ====================================================
1555.74s call tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
6.97s setup tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
0.00s teardown tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
================================================= short test summary info =================================================
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test - assert False
======================================== 1 failed, 1 warning in 1562.92s (0:26:02) ========================================
Device | INFO | Closing user mode device drivers
I made some changes to try to profile the test, and the test passed once in a few seconds:
(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ export TT_METAL_THREADCOUNT=16
(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ echo $TT_METAL_THREADCOUNT
16
(python_env) bbradel@yyzo-bh-03-special-bbradel-for-reservation-3967930:/proj_sw/user_dev/bbradel/tt-metal$ ./tt_metal/tools/profiler/profile_this.py -c "pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py"
2024-09-25 21:17:01.167 | WARNING | __main__:main:38 - profile_this.py is getting deprecated soon. Please use the tracy.py module with -r option to obtain op reports.
2024-09-25 21:17:01.167 | INFO | __main__:main:42 - profile_this.py is running pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py
2024-09-25 21:17:02.910 | INFO | __main__:main:89 - Using port 8086
2024-09-25 21:17:02.910 | INFO | tracy:run_report_setup:87 - Verifying tracy profiling tools
2024-09-25 21:17:02.967 | INFO | tracy:run_report_setup:108 - Capture command: ('/proj_sw/user_dev/bbradel/tt-metal/build/tools/profiler/bin/capture-release -o /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_profile_log_host.tracy -f -p 8086',)
2024-09-25 21:17:02.975 | INFO | __main__:main:146 - Test process started
Connecting to 127.0.0.1:8086...2024-09-25 21:17:04.511 | DEBUG | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/bbradel/.cache/ttnn,model_cache_path=/home/bbradel/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
Queue delay: 13 ns
Timer resolution: 0 ns
6.98 Mbps / 9.8% = 71.21 Mbps | Tx: 1548.91 KB | 66.94 MB | 2.21 s2024-09-25 21:17:06.519 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-09-25 21:17:06.522 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
2024-09-25 21:17:06.533 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.reshape be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.squeeze be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.from_torch be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.to_torch be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.to_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.from_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.deallocate be migrated to C++?
2024-09-25 21:17:06.534 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.reallocate be migrated to C++?
2024-09-25 21:17:06.535 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.load_tensor be migrated to C++?
2024-09-25 21:17:06.535 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.dump_tensor be migrated to C++?
2024-09-25 21:17:06.535 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.as_tensor be migrated to C++?
2024-09-25 21:17:06.541 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 21:17:06.550 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-25 21:17:06.551 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-25 21:17:06.551 | WARNING | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
5.46 Mbps / 12.3% = 44.27 Mbps | Tx: 2655.23 KB | 67.88 MB | 3.74 s=================================================== test session starts ===================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/user_dev/bbradel/tt-metal/python_env/bin/python
cachedir: .pytest_cache
rootdir: /proj_sw/user_dev/bbradel/tt-metal, configfile: pytest.ini
plugins: dash-2.15.0, xdist-3.6.1, timeout-2.2.0, anyio-4.4.0, split-0.8.2
timeout: 300.0s
timeout method: signal
timeout func_only: False
collected 1 item
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test Device | INFO | Opening user mode device driver
10.77 Mbps / 12.5% = 86.08 Mbps | Tx: 2920.85 KB | 67.9 MB | 3.84 s2024-09-25 21:17:08.177 | INFO | SiliconDriver - Detected 1 PCI device : [0]
5.22 Mbps / 14.1% = 36.88 Mbps | Tx: 3272.49 KB | 68.29 MB | 4.55 s Metal | INFO | Initializing device 0. Program cache is NOT enabled
Device | INFO | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
Metal | INFO | AI CLK for device 0 is: 800 MHz
13.30 Mbps / 10.5% = 126.76 Mbps | Tx: 13.68 MB | 140.4 MB | 16.94 s2024-09-25 21:17:21.273 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:48 - Passing=True
2024-09-25 21:17:21.273 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_tilize_matmul:run_tilize_matmul_test:49 - Output pcc=0.9999909687232241
PASSED Device | INFO | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
13.30 Mbps / 10.5% = 126.76 Mbps | Tx: 13.68 MB | 140.42 MB | 17.04 s Metal | INFO | Closing device 0
Device | INFO | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
Metal | INFO | Disabling and clearing program cache on device 0
==================================================== warnings summary =====================================================
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py:53: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
def test_run_tilize_matmul_test(device):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================= PASSES ==========================================================
----------- generated xml file: /proj_sw/user_dev/bbradel/tt-metal/generated/test_reports/most_recent_tests.xml -----------
==================================================== slowest durations ====================================================
7.10s setup tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
6.13s call tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
0.10s teardown tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
================================================= short test summary info =================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_tilize_matmul.py::test_run_tilize_matmul_test
============================================== 1 passed, 1 warning in 13.36s ==============================================
6.12 Mbps / 17.4% = 35.21 Mbps | Tx: 13.99 MB | 140.52 MB | 17.28 s Device | INFO | Closing user mode device drivers
6.12 Mbps / 17.4% = 35.21 Mbps | Tx: 13.99 MB | 140.62 MB | 17.39 s Always | WARNING | For device 0, core 1,2 default frequency was used and its zones will be out of sync
9.42 Mbps / 5.6% = 167.73 Mbps | Tx: 19.48 MB | 144.68 MB | 23.11 s
Frames: 2
Time span: 23.11 s
Zones: 1,280
Elapsed time: 29.83 s
Saving trace... done!
Trace size 32.34 MB (30.48% ratio)
2024-09-25 21:17:35.101 | INFO | tracy:generate_report:153 - Host side ops time report generated at /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_ops_times.csv
2024-09-25 21:17:35.307 | INFO | tracy:generate_report:164 - Host side ops data report generated at /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/.logs/tracy_ops_data.csv
2024-09-25 21:17:35.307 | INFO | tt_metal.tools.profiler.process_ops_logs:import_tracy_op_logs:81 - Importing ops logs
2024-09-25 21:17:35.314 | INFO | tt_metal.tools.profiler.process_ops_logs:get_device_op_data:161 - Getting device ops
2024-09-25 21:17:35.314 | INFO | tt_metal.tools.profiler.process_ops_logs:append_device_data:183 - Appending device data
2024-09-25 21:17:35.315 | INFO | tt_metal.tools.profiler.process_device_log:core_to_device_timeseries:360 - Importing Data For Device Number : 0
2024-09-25 21:17:35.342 | INFO | tt_metal.tools.profiler.process_ops_logs:generate_reports:330 - OPs' perf analysis is finished! Generating reports ...
2024-09-25 21:17:35.342 | INFO | tt_metal.tools.profiler.process_ops_logs:generate_reports:348 - Copying runtime artifacts
2024-09-25 21:17:35.669 | INFO | tt_metal.tools.profiler.process_ops_logs:generate_reports:367 - Generating OPs CSV
2024-09-25 21:17:35.671 | INFO | tt_metal.tools.profiler.process_ops_logs:generate_reports:545 - OPs csv generated at: /proj_sw/user_dev/bbradel/tt-metal/generated/profiler/reports/2024_09_25_21_17_35/ops_perf_results_2024_09_25_21_17_35.csv
Definitely non deterministic behaviour. There may be some kind of timing issue.
I added extra DPRINTs:
diff --git a/tt_metal/hw/inc/dataflow_api.h b/tt_metal/hw/inc/dataflow_api.h
index b46cec89b7..e8d7522fa4 100644
--- a/tt_metal/hw/inc/dataflow_api.h
+++ b/tt_metal/hw/inc/dataflow_api.h
@@ -27,6 +27,7 @@
#include "third_party/umd/device/tt_silicon_driver_common.hpp"
#include "debug/assert.h"
#include "dev_msgs.h"
+#include "debug/dprint.h"
extern uint8_t noc_index;
extern uint32_t tt_l1_ptr *rta_l1_base;
@@ -268,7 +269,9 @@ void cb_push_back(const int32_t operand, const int32_t num_pages) {
uint32_t num_words = num_pages * cb_interface[operand].fifo_page_size;
volatile tt_reg_ptr uint32_t* pages_received_ptr = get_cb_tiles_received_ptr(operand);
+ DPRINT << "CB_PUSH_BACK " << operand << " " << num_pages << " " << pages_received_ptr[0] << ENDL();
pages_received_ptr[0] += num_pages;
+ DPRINT << "CB_PUSH_BACK2 " << operand << " " << num_pages << " " << pages_received_ptr[0] << ENDL();
cb_interface[operand].fifo_wr_ptr += num_words;
@@ -476,6 +479,7 @@ void cb_wait_front(int32_t operand, int32_t num_pages) {
do {
invalidate_l1_cache();
pages_received = ((uint16_t)reg_read(pages_received_ptr)) - pages_acked;
+ DPRINT << "CB_WAIT_FRONT " << pages_acked << " " << (uint16_t)reg_read(pages_received_ptr) << " " << pages_received << ENDL();
} while (pages_received < num_pages);
WAYPOINT("CWFD");
}
the problem is that reg_read(pages_received_ptr) is reading 0 instead of 1.
Good:
a 0 0 1
a1 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
cmd_write_packed
cmd_wait
DISPATCH WAIT 1a4b0 count 1
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1 1
CB_WAIT_FRONT 0 1 0
CB_WAIT_FRONT 0 1 1
b 0 0 1
b1 0 1 1
c 0 0 1
cmd_write_paged is_dram: 1
d 0 0 1
d1 0 1 1
d2 6854 6982
e 0 0 1
e1 1 1 0
e2 6854 6982
cmd_wait
DISPATCH WAIT 1a4b0 count 1
cmd_write_paged is_dram: 1
write offset: 0 102944 0
cmd_write_packed
cmd_write_packed
cmd_wait
DISPATCH BARRIER
DISPATCH WAIT 1a4b0 count 1
cmd_write_packed_large
cmd_wait
DISPATCH BARRIER
cmd_write_packed
cmd_wait
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1 1a DISPATCH BARRIER
DISPATCH WAIT 1a4b0 count 2
CB_PUSH_BACK 1 1 0
CB_PUSH_BACK2 1 1 1
0 0 1
a1 0 1 1
CB_WAIT_FRONT 0 1 1
b 0 0 1
...
Bad:
cmd_wait
DISPATCH BARRIER
DISPATCH WAIT 1dbd0 count 1
cmd_write_packed_large
cmd_wait
x0
CB_PUSH_BACK 0 1 0
CB_PUSH_BACK2 0 1XX0
a 0 0 1 DISPATCH BARRIER
cmd_write_packed
1
x1
CB_PUSH_BACK 1 1 0
CB_PUSH_BACK2 1 1 1
Pushed itileA=0 itileB=0
XX5
XX6
XX7
XX8
XX9
AA0
AA1
XXA
16
a1 0 0 0
CB_WAIT_FRONT 0 0 0
CB_WAIT_FRONT 0 0 0
Adding in PACK (DPRINT...) statements, instead of just DPRINT statements, in the math kernel shows that when the hang does occur, then cb_push_back is not called (in a reasonable time frame) in the math kernel. Therefore cb_wait_front seems to be doing the right thing.
This only consistently hangs when both tilize in the python test and matmul_tiles() in the math kernel ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm.cpp are called.
if
For BH misc/test_tilize_matmul.py hanging