Closed TT-billteng closed 8 months ago
Could be related to this: https://github.com/tenstorrent-metal/tt-metal/issues/3962
I'm seeing really simple tests (that always pass after a board reset) that fail after python_api slow dispatch tests are run.
I see these tests pass with
TT_METAL_WATCHER=60 pytest tests/tt_eager/python_api_testing/unit_testing/test_complex.py
on GS BM. Are the failures sporadic or reproducible ?
also if the tests are only failing in WH we can allow tests to run on GS and filter for only WH using the decorators,
from models.utility_functions import skip_for_wormhole_b0
@skip_for_wormhole_b0()
def test_tutorials():
@muthutt please try to reproduce asap and update here today. Thanks.
Okay
On Thu, Nov 30, 2023 at 3:21 PM jvasilje @.***> wrote:
@muthutt https://github.com/muthutt please try to reproduce asap and update here today. Thanks.
— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/4083#issuecomment-1834714291, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNH3KJUD5TKVXCGGH6LYHEIF7AVCNFSM6AAAAABAAFLPR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZUG4YTIMRZGE . You are receiving this because you were mentioned.Message ID: @.***>
reproducible on WH B0 with TT_METAL_WATCHER=60; the test does not fail outside the watcher infra e.g.
Need to debug with watcher
we narrow this down to,
tests/tt_eager/python_api_testing/unit_testing/test_complex.py::test_level2_abs[bs0-dtype0-out_DRAM] Metal | INFO | Initializing device 0
Device | INFO | Opening device driver
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
2023-12-01 00:41:33.732 | INFO | SiliconDriver - Detected 1 PCI device
2023-12-01 00:41:33.796 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-12-01 00:41:33.916 | INFO | SiliconDriver - Disable PCIE DMA
Metal | INFO | AI CLK for device 0 is: 1000 MHz
LLRuntime | INFO | Watcher log file: /home/muthu/tt-metal/built/watcher.log
LLRuntime | INFO | Watcher attached device 0
LLRuntime | INFO | Watcher thread watching...
LLRuntime | INFO | Watcher checking device 0
Always | INFO | Watcher stopped the device due to bad NOC unicast transaction
Always | INFO | While running kernels:
Always | INFO | brisc : tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
Always | INFO | ncrisc: tt_eager/tt_dnn/kernels/dataflow/reader_unary_stick_layout_split_rows_interleaved.cpp
Always | INFO | triscs: tt_eager/tt_dnn/kernels/compute/tilize.cpp
Always | INFO | Last waypoint: NWTW,W,W,W,W
terminate called after throwing an instance of 'std::runtime_error'
what(): TT_THROW @ tt_metal/llrt/watcher.cpp:226: tt::exception
info:
On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048}
backtrace:
--- /home/muthu/tt-metal/build/lib/libtt_metal.so(+0x26d398) [0x7efe05b3d398]
--- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7efea9ef6df4]
--- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7efec7d1e609]
--- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7efec7e58133]
After chat with @pgkeller the watcher is throwing an error because of a bad noc transaction. The key is here: On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048} This says the noc0 on brisc tried to send data to core 0, 48 which doesn’t exist. The kernels are listed above, looks like it is the writer_unary_interleaved_start_id kernel that erred.
All green now on main https://github.com/tenstorrent-metal/tt-metal/actions/runs/7441497962
test_level1_is_real
fails with:test_level1_is_imag
also fails:There are more failures in complex which I disabled, but this shows up on FD and SD for WH
Repro with:
TT_METAL_WATCHER=60 pytest tests/tt_eager/python_api_testing/unit_testing/test_complex.py