tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
428 stars 57 forks source link

test_optimized_conv.py::test_run_optimized_conv WH error with watcher enabled #4116

Closed TT-billteng closed 9 months ago

TT-billteng commented 10 months ago

FD on WH

TT_METAL_WATCHER=60 pytest tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py::test_run_optimized_conv

tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py::test_run_optimized_conv[64-32-2-2-1-1-1-1-0-00-16-True-True-False]                      Op | INFO     | Program Cache: enabled.
                  Metal | INFO     | Initializing device 0
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
              LLRuntime | INFO     | Watcher attached device 0
tt_tensor shape: [1, 1, 32, 64]
              LLRuntime | INFO     | Watcher checking device 0
                 Always | INFO     | Watcher stopped the device due to bad NOC unicast transaction
                 Always | INFO     | While running kernels:
                 Always | INFO     |  brisc : tt_eager/tt_dnn/op_library/conv/kernels/writer_tiled_out_mcast_sender_conv_weights_tiled_col_to_rm_blocks.cpp
                 Always | INFO     |  ncrisc: tt_eager/tt_dnn/op_library/conv/kernels/reader_conv1x1_activations_fast_for_col_major_conv_out_blocks.cpp
                 Always | INFO     |  triscs: tt_eager/tt_dnn/op_library/conv/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
                 Always | INFO     | Last waypoint: NRTW,W,WNCD,WDSD,R 
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/llrt/watcher.cpp:226: tt::exception
info:
On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048}
TT-billteng commented 9 months ago

All green now on main https://github.com/tenstorrent-metal/tt-metal/actions/runs/7441497962