tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
488 stars 80 forks source link

N300 watcher error in stress test pipeline #14411

Open TT-billteng opened 1 month ago

TT-billteng commented 1 month ago

https://github.com/tenstorrent/tt-metal/actions/runs/11562146971/job/32183169130

2024-10-28T22:34:15.0908094Z tests/ttnn/unit_tests/operations/ccl/test_all_gather_N300_post_commit.py::test_all_gather_on_n300_post_commit[silicon_arch_name=wormhole_b0-silicon_arch_wormhole_b0=True-enable_async=True-num_iters=1-mem_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)-input_dtype=DataType.BFLOAT16-num_devices=2-num_links=1-output_shape=[1, 1, 32, 704]-dim=3-layout=Layout.TILE]                   Metal | INFO     | Initializing device 0. Program cache is NOT enabled
2024-10-28T22:34:15.0912073Z                   Metal | INFO     | AI CLK for device 0 is:   1000 MHz
2024-10-28T22:34:15.0913614Z                   Metal | INFO     | Initializing device 1. Program cache is NOT enabled
2024-10-28T22:34:15.0915132Z                   Metal | INFO     | AI CLK for device 1 is:   1000 MHz
2024-10-28T22:34:15.0997699Z               LLRuntime | INFO     | Watcher log file: /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/generated/watcher/watcher.log
2024-10-28T22:34:15.0999731Z               LLRuntime | INFO     | Watcher attached device 0
2024-10-28T22:34:15.1001598Z               LLRuntime | INFO     | Watcher server initialized, disabled features: None
2024-10-28T22:34:15.1010279Z                   Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
2024-10-28T22:34:15.1011894Z                   Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
2024-10-28T22:34:15.1656828Z               LLRuntime | INFO     | Watcher attached device 1
2024-10-28T22:34:15.9138363Z 2024-10-28 22:34:15.913 | DEBUG    | conftest:n300_mesh_device:268 - multidevice with 2 devices is created
2024-10-28T22:34:15.9168221Z 2024-10-28 22:34:15.916 | INFO     | tests.ttnn.unit_tests.operations.ccl.test_all_gather:run_all_gather_impl:142 - Using Async Mode for All Gather Op Dispatch
2024-10-28T22:34:15.9171049Z 2024-10-28 22:34:15.916 | INFO     | tests.ttnn.unit_tests.operations.ccl.test_all_gather:run_all_gather_impl:143 - Input shape: [1, 1, 32, 704]
2024-10-28T22:34:15.9173219Z 2024-10-28 22:34:15.916 | INFO     | tests.ttnn.unit_tests.operations.ccl.test_all_gather:run_all_gather_impl:144 - dim: 3
2024-10-28T22:34:15.9174985Z 2024-10-28 22:34:15.916 | INFO     | tests.ttnn.unit_tests.operations.ccl.test_all_gather:run_all_gather_impl:146 - Input shape: [1, 1, 32, 704]
2024-10-28T22:34:15.9176828Z 2024-10-28 22:34:15.916 | INFO     | tests.ttnn.unit_tests.operations.ccl.test_all_gather:run_all_gather_impl:147 - dim: 3
2024-10-28T22:35:15.1001638Z               LLRuntime | INFO     | Watcher checking device 1
2024-10-28T22:35:15.1025313Z                  Always | WARNING  | Watcher stopped the device due to tripped assert, see watcher log for more details
2024-10-28T22:35:15.1043703Z                  Always | WARNING  | Device 1 worker core(x= 3,y= 1) phys(x= 4,y= 3): brisc tripped an assert on line 94. Current kernel: ttnn/cpp/ttnn/operations/ccl/all_gather/device/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp. Note that file name reporting is not yet implemented, and the reported line number for the assert may be from a different file.
2024-10-28T22:35:15.1048257Z                  Always | INFO     | Last waypoint:    R,   W,   W,   W,   W 
2024-10-28T22:35:15.1050036Z                  Always | INFO     | While running kernels:
2024-10-28T22:35:15.1052621Z                  Always | INFO     |  brisc : ttnn/cpp/ttnn/operations/ccl/all_gather/device/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
2024-10-28T22:35:15.1055988Z                  Always | INFO     |  ncrisc: ttnn/cpp/ttnn/operations/ccl/all_gather/device/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
2024-10-28T22:35:15.1058405Z                  Always | INFO     |  triscs: blank
2024-10-28T22:35:15.1060227Z                  Always | FATAL    | Watcher detected tripped assert and stopped device.
2024-10-28T22:35:15.1067599Z libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tt_metal/impl/debug/watcher_device_reader.cpp:488: tt::exception
2024-10-28T22:35:15.1069683Z info:
2024-10-28T22:35:15.1070353Z Watcher detected tripped assert and stopped device.
2024-10-28T22:35:15.1071169Z backtrace:
2024-10-28T22:35:15.1072387Z  --- /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/_ttnn.so(+0x3cdc48) [0x7fa269513c48]
2024-10-28T22:35:15.1075164Z  --- tt::watcher::WatcherDeviceReader::DumpAssertStatus(CoreDescriptor&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, mailboxes_t const*)
2024-10-28T22:35:15.1077620Z  --- tt::watcher::WatcherDeviceReader::DumpCore(CoreDescriptor&, bool)
2024-10-28T22:35:15.1078846Z  --- tt::watcher::WatcherDeviceReader::Dump(_IO_FILE*)
2024-10-28T22:35:15.1080479Z  --- /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/lib/libtt_metal.so(+0x1d86c4) [0x7fa26909a6c4]
2024-10-28T22:35:15.1082545Z  --- /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/lib/libtt_metal.so(+0x1d7e3a) [0x7fa269099e3a]
2024-10-28T22:35:15.1084605Z  --- /home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/lib/libtt_metal.so(+0x1d9ce7) [0x7fa26909bce7]
2024-10-28T22:35:15.1086242Z  --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa2944a4609]
2024-10-28T22:35:15.1087447Z  --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa2945de353]
2024-10-28T22:35:15.1088139Z 
2024-10-28T22:35:15.1088567Z Fatal Python error: Aborted
2024-10-28T22:35:15.1089122Z 
2024-10-28T22:35:15.1091891Z Thread 0x00007fa29433d740 (most recent call first):
2024-10-28T22:35:15.1096192Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/unit_tests/operations/ccl/test_all_gather.py", line 163 in run_all_gather_impl
2024-10-28T22:35:15.1102070Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/unit_tests/operations/ccl/test_all_gather.py", line 201 in run_all_gather_on_n300_impl
2024-10-28T22:35:15.1108184Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/unit_tests/operations/ccl/test_all_gather_N300_post_commit.py", line 70 in test_all_gather_on_n300_post_commit
2024-10-28T22:35:15.1113973Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
2024-10-28T22:35:15.1119801Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
2024-10-28T22:35:15.1125818Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
2024-10-28T22:35:15.1131098Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
2024-10-28T22:35:15.1136277Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
2024-10-28T22:35:15.1141762Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
2024-10-28T22:35:15.1147205Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
2024-10-28T22:35:15.1152574Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
2024-10-28T22:35:15.1157782Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
2024-10-28T22:35:15.1163102Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
2024-10-28T22:35:15.1168334Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
2024-10-28T22:35:15.1184204Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
2024-10-28T22:35:15.1187401Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
2024-10-28T22:35:15.1189516Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
2024-10-28T22:35:15.1191717Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
2024-10-28T22:35:15.1193789Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
2024-10-28T22:35:15.1195807Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
2024-10-28T22:35:15.1198012Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
2024-10-28T22:35:15.1200057Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
2024-10-28T22:35:15.1202120Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
2024-10-28T22:35:15.1204345Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
2024-10-28T22:35:15.1206317Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
2024-10-28T22:35:15.1208236Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
2024-10-28T22:35:15.1210153Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
2024-10-28T22:35:15.1212196Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
2024-10-28T22:35:15.1214294Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
2024-10-28T22:35:15.1216306Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
2024-10-28T22:35:15.1218272Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
2024-10-28T22:35:15.1220257Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
2024-10-28T22:35:15.1222363Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
2024-10-28T22:35:15.1224161Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>
2024-10-28T22:35:15.4354856Z ./tests/scripts/run_python_api_unit_tests.sh: line 21: 817986 Aborted                 (core dumped) env pytest $TT_METAL_HOME/tests/ttnn/unit_tests -xvvv
20
SeanNijjar commented 1 month ago

Likely tripping on this assert in worker_edm_adapters.cpp

    constexpr WorkerToEdmSender (
        ttnn::ccl::WorkerXY edm_worker_xy,
        std::size_t edm_buffer_base_addr,
        std::size_t num_buffers_per_channel,
        std::size_t edm_l1_sem_addr,
        std::size_t buffer_size_bytes,
        volatile uint32_t * const worker_sem_addr
    ) :
        edm_buffer_addr(get_noc_addr(edm_worker_xy.x, edm_worker_xy.y, edm_buffer_base_addr)),
        edm_semaphore_addr(get_noc_addr(edm_worker_xy.x, edm_worker_xy.y, edm_l1_sem_addr)),
        worker_sem_addr(worker_sem_addr),
        edm_buffer_base_addr(edm_buffer_base_addr),
        num_buffers_per_channel(num_buffers_per_channel),
        last_buffer_index(num_buffers_per_channel - 1),
        edm_l1_sem_addr(edm_l1_sem_addr),
        buffer_size_bytes(buffer_size_bytes),
        buffer_index(0)
    {
        ASSERT(buffer_size_bytes > 0);
    }

The buffer_size_bytes assert may be removable if we invoke workers without work. Technically we should allow this although it would be suboptimal. Ideally we never produce this scenario because the host code is smart enough