tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
436 stars 63 forks source link

DeviceParamFixture.DeviceInitializeAndTeardown ND hang on N300 #9594

Open TT-billteng opened 4 months ago

TT-billteng commented 4 months ago

This is ND and can be reproduced by repeating some tests:

./build/test/tt_metal/unit_tests_fast_dispatch --gtest_filter=*DeviceParamFixture* --gtest_repeat=-1 --gtest_break_on_failure

Repeating all tests (iteration 34) . . .

Note: Google Test filter = *DeviceParamFixture*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from DeviceInit/DeviceParamFixture
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
[       OK ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0 (25 ms)
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/1
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | Initializing device 1. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 1 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                  Metal | INFO     | Closing device 1
Repeating all tests (iteration 23) . . .

Note: Google Test filter = *DeviceParamFixture*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from DeviceInit/DeviceParamFixture
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
[       OK ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0 (26 ms)
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/1
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | Initializing device 1. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 1 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                  Metal | INFO     | Closing device 1
Repeating all tests (iteration 323) . . .

Note: Google Test filter = *DeviceParamFixture*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from DeviceInit/DeviceParamFixture
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
[       OK ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/0 (25 ms)
[ RUN      ] DeviceInit/DeviceParamFixture.DeviceInitializeAndTeardown/1
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | Initializing device 1. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 1 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 1
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                  Metal | INFO     | Closing device 1

https://github.com/tenstorrent/tt-metal/actions/runs/9604648260/job/26490801278 https://github.com/tenstorrent/tt-metal/actions/runs/9570597099/job/26385904897

smehtaTT commented 3 months ago

@aliuTT - can you help triage this please?

aliuTT commented 3 months ago

I wasn't able to repro locally last week. Still trying to find a setup where this fails. I may have to grab a cloud VM for this debug.