tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 48 forks source link

test_complex.py WH B0 watcher errors #4083

Closed TT-billteng closed 8 months ago

TT-billteng commented 9 months ago

test_level1_is_real fails with:

tests/tt_eager/python_api_testing/unit_testing/test_complex.py::test_level1_is_real[dtype0-out_DRAM]                   Metal | INFO     | Initializing device 0
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
              LLRuntime | INFO     | Watcher attached device 0
              LLRuntime | INFO     | Watcher checking device 0
                 Always | INFO     | Watcher stopped the device due to bad NOC unicast transaction
                 Always | INFO     | While running kernels:
                 Always | INFO     |  brisc : tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
                 Always | INFO     |  ncrisc: tt_eager/tt_dnn/kernels/dataflow/reader_unary_stick_layout_split_rows_interleaved.cpp
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/llrt/watcher.cpp:226: tt::exception
info:
On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048}

test_level1_is_imag also fails:


tests/tt_eager/python_api_testing/unit_testing/test_complex.py::test_level1_is_imag[dtype0-out_DRAM]                   Metal | INFO     | Initializing device 0
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
              LLRuntime | INFO     | Watcher attached device 0
              LLRuntime | INFO     | Watcher checking device 0
                 Always | INFO     | Watcher stopped the device due to bad NOC unicast transaction
                 Always | INFO     | While running kernels:
                 Always | INFO     |  brisc : tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
                 Always | INFO     |  ncrisc: tt_eager/tt_dnn/kernels/dataflow/reader_unary_stick_layout_split_rows_interleaved.cpp
                 Always | INFO     |  triscs: tt_eager/tt_dnn/kernels/compute/tilize.cpp
                 Always | INFO     | Last waypoint: NWTW,W,W,W,W 
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/llrt/watcher.cpp:226: tt::exception
info:
On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048}

There are more failures in complex which I disabled, but this shows up on FD and SD for WH

Repro with:

TT_METAL_WATCHER=60 pytest tests/tt_eager/python_api_testing/unit_testing/test_complex.py

aliuTT commented 9 months ago

Could be related to this: https://github.com/tenstorrent-metal/tt-metal/issues/3962

I'm seeing really simple tests (that always pass after a board reset) that fail after python_api slow dispatch tests are run.

muthutt commented 9 months ago

I see these tests pass with

TT_METAL_WATCHER=60 pytest tests/tt_eager/python_api_testing/unit_testing/test_complex.py

on GS BM. Are the failures sporadic or reproducible ?

muthutt commented 9 months ago

also if the tests are only failing in WH we can allow tests to run on GS and filter for only WH using the decorators,

from models.utility_functions import skip_for_wormhole_b0
@skip_for_wormhole_b0()
def test_tutorials():
jvasilje commented 9 months ago

@muthutt please try to reproduce asap and update here today. Thanks.

muthutt commented 9 months ago

Okay

On Thu, Nov 30, 2023 at 3:21 PM jvasilje @.***> wrote:

@muthutt https://github.com/muthutt please try to reproduce asap and update here today. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/4083#issuecomment-1834714291, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNH3KJUD5TKVXCGGH6LYHEIF7AVCNFSM6AAAAABAAFLPR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZUG4YTIMRZGE . You are receiving this because you were mentioned.Message ID: @.***>

muthutt commented 9 months ago

reproducible on WH B0 with TT_METAL_WATCHER=60; the test does not fail outside the watcher infra e.g.

image

Need to debug with watcher

muthutt commented 9 months ago

we narrow this down to,

tests/tt_eager/python_api_testing/unit_testing/test_complex.py::test_level2_abs[bs0-dtype0-out_DRAM]                   Metal | INFO     | Initializing device 0
                 Device | INFO     | Opening device driver
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
2023-12-01 00:41:33.732 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-12-01 00:41:33.796 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-12-01 00:41:33.916 | INFO     | SiliconDriver   - Disable PCIE DMA
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
              LLRuntime | INFO     | Watcher log file: /home/muthu/tt-metal/built/watcher.log
              LLRuntime | INFO     | Watcher attached device 0
              LLRuntime | INFO     | Watcher thread watching...
              LLRuntime | INFO     | Watcher checking device 0
                 Always | INFO     | Watcher stopped the device due to bad NOC unicast transaction
                 Always | INFO     | While running kernels:
                 Always | INFO     |  brisc : tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
                 Always | INFO     |  ncrisc: tt_eager/tt_dnn/kernels/dataflow/reader_unary_stick_layout_split_rows_interleaved.cpp
                 Always | INFO     |  triscs: tt_eager/tt_dnn/kernels/compute/tilize.cpp
                 Always | INFO     | Last waypoint: NWTW,W,W,W,W 
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/llrt/watcher.cpp:226: tt::exception
info:
On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048}
backtrace:
 --- /home/muthu/tt-metal/build/lib/libtt_metal.so(+0x26d398) [0x7efe05b3d398]
 --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7efea9ef6df4]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7efec7d1e609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7efec7e58133]

After chat with @pgkeller the watcher is throwing an error because of a bad noc transaction. The key is here: On core (x=1,y=1): noc0:brisc{(00,48) 0x000001a0, 2048} This says the noc0 on brisc tried to send data to core 0, 48 which doesn’t exist. The kernels are listed above, looks like it is the writer_unary_interleaved_start_id kernel that erred.

TT-billteng commented 8 months ago

All green now on main https://github.com/tenstorrent-metal/tt-metal/actions/runs/7441497962