tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
475 stars 75 forks source link

ND WH watcher error when attempting to turn on watcher in all post-commit pipelines #6763

Open TT-billteng opened 7 months ago

TT-billteng commented 7 months ago

I'm trying to enable watcher on all non-perf pipelines so that device-side issues reported by watcher can be caught sooner.

On my branch where I try to enable watcher, I see this error when running post-commit action:

https://github.com/tenstorrent-metal/tt-metal/actions/runs/8429687483/job/23084407648

2024-03-26T02:04:38.0322916Z tests/tt_eager/python_api_testing/unit_testing/misc/test_optimized_conv_v2.py::test_optimized_conv_v2[pack_l1-LoFi-activations_BFLOAT16-weights_BFLOAT8_B-8-128-128-28-28-3-3-1-1-1-1-True-True-False-False]                   Metal | INFO     | Initializing device 0
2024-03-26T02:04:38.0684146Z                   Metal | INFO     | AI CLK for device 0 is:   800 MHz
2024-03-26T02:04:38.0746331Z               LLRuntime | INFO     | Watcher log file: /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/generated/watcher/watcher.log
2024-03-26T02:04:38.0749693Z               LLRuntime | INFO     | Watcher attached device 0
2024-03-26T02:04:38.0751749Z               LLRuntime | INFO     | Watcher thread watching...
2024-03-26T02:04:38.1134412Z 2024-03-26 02:04:38.113 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_optimized_conv_v2:test_optimized_conv_v2:160 - Conv output shape - [8, 28, 28, 128]
2024-03-26T02:05:38.0751917Z               LLRuntime | INFO     | Watcher checking device 0
2024-03-26T02:05:38.0954477Z terminate called after throwing an instance of 'std::runtime_error'
2024-03-26T02:05:38.0956670Z   what():  Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
2024-03-26T02:05:38.0957520Z Fatal Python error: Aborted
2024-03-26T02:05:38.0965755Z 
2024-03-26T02:05:38.0993090Z Thread 0x00007f2d1214e740 (most recent call first):
2024-03-26T02:05:38.0995270Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tt_eager/tt_dnn/op_library/sliding_window_op_infra/tt_py_composite_conv.py", line 1104 in copy_output_from_device
2024-03-26T02:05:38.0997252Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/tt_eager/python_api_testing/unit_testing/misc/test_optimized_conv_v2.py", line 233 in test_optimized_conv_v2
2024-03-26T02:05:38.0998926Z   File "/home/ubuntu/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
TT-billteng commented 7 months ago

@jliangTT need some help with figuring out who should own this bug as I don't see a clear "owner" for this file

TT-billteng commented 7 months ago

another failed run https://github.com/tenstorrent-metal/tt-metal/actions/runs/8461487472/job/23181436986

jliangTT commented 7 months ago

tests/tt_eager/python_api_testing/unit_testing/misc/test_optimized_conv_v2.py

@tt-nshanker , is this the test case related to the 2.0 development?