Closed davorchap closed 9 months ago
@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?
@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?
@muthutt -- gdb
is reporting that we're waiting on core {x = 7, y = 1}
, so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.
(gdb) up
#4 0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
at /usr/include/c++/9/bits/char_traits.h:300
300 { __c1 = __c2; }
(gdb) info locals
chip_id = <optimized out>
virtual_dram_core = {<tt_xy_pair> = {x = 7, y = 1}, chip = 0}
(gdb) up
#5 0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
38 tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {}
(gdb) info locals
read_hex_vec = std::vector of length 1, capacity 1 = {0}
(gdb) up
#6 0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
429 run_mailbox_read_val = read_hex_vec_from_core(
(gdb) info locals
Python Exception <class 'gdb.error'> value has been optimized out:
run_mailbox_read_val =
cluster = @0x7ffc7b436e98: 0x64a0550
chip_id = @0x7ffc7b436e94: 0
core = @0x7ffc7b436f70: {x = 7, y = 1}
cluster = <optimized out>
chip_id = <optimized out>
core = <optimized out>
run_mailbox_read_val = <optimized out>
@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?
@muthutt --
gdb
is reporting that we're waiting on core{x = 7, y = 1}
, so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.(gdb) up #4 0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>) at /usr/include/c++/9/bits/char_traits.h:300 300 { __c1 = __c2; } (gdb) info locals chip_id = <optimized out> virtual_dram_core = {<tt_xy_pair> = {x = 7, y = 1}, chip = 0} (gdb) up #5 0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>) at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38 38 tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {} (gdb) info locals read_hex_vec = std::vector of length 1, capacity 1 = {0} (gdb) up #6 0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429 429 run_mailbox_read_val = read_hex_vec_from_core( (gdb) info locals Python Exception <class 'gdb.error'> value has been optimized out: run_mailbox_read_val = cluster = @0x7ffc7b436e98: 0x64a0550 chip_id = @0x7ffc7b436e94: 0 core = @0x7ffc7b436f70: {x = 7, y = 1} cluster = <optimized out> chip_id = <optimized out> core = <optimized out> run_mailbox_read_val = <optimized out>
ok definitely a multi-core test, on 56 cores, got this from gdb:
(gdb) up
#11 tt::tt_metal::Kernel::logical_cores (this=<optimized out>) at tt_metal/impl/kernels/kernel.cpp:33
33 for (auto core_range : this->core_range_set_.ranges()) {
(gdb) info locals
core_range = <optimized out>
__for_range = <optimized out>
__for_begin = <optimized out>
__for_end = <optimized out>
cores = std::set with 56 elements = {[0] = {x = 0, y = 0}, [1] = {x = 0, y = 1}, [2] = {x = 0, y = 2}, [3] = {x = 0, y = 3}, [4] = {x = 0, y = 4}, [5] = {x = 0, y = 5}, [6] = {x = 0,
y = 6}, [7] = {x = 1, y = 0}, [8] = {x = 1, y = 1}, [9] = {x = 1, y = 2}, [10] = {x = 1, y = 3}, [11] = {x = 1, y = 4}, [12] = {x = 1, y = 5}, [13] = {x = 1, y = 6}, [14] = {x = 2,
y = 0}, [15] = {x = 2, y = 1}, [16] = {x = 2, y = 2}, [17] = {x = 2, y = 3}, [18] = {x = 2, y = 4}, [19] = {x = 2, y = 5}, [20] = {x = 2, y = 6}, [21] = {x = 3, y = 0}, [22] = {x = 3,
y = 1}, [23] = {x = 3, y = 2}, [24] = {x = 3, y = 3}, [25] = {x = 3, y = 4}, [26] = {x = 3, y = 5}, [27] = {x = 3, y = 6}, [28] = {x = 4, y = 0}, [29] = {x = 4, y = 1}, [30] = {x = 4,
y = 2}, [31] = {x = 4, y = 3}, [32] = {x = 4, y = 4}, [33] = {x = 4, y = 5}, [34] = {x = 4, y = 6}, [35] = {x = 5, y = 0}, [36] = {x = 5, y = 1}, [37] = {x = 5, y = 2}, [38] = {x = 5,
y = 3}, [39] = {x = 5, y = 4}, [40] = {x = 5, y = 5}, [41] = {x = 5, y = 6}, [42] = {x = 6, y = 0}, [43] = {x = 6, y = 1}, [44] = {x = 6, y = 2}, [45] = {x = 6, y = 3}, [46] = {x = 6,
y = 4}, [47] = {x = 6, y = 5}, [48] = {x = 6, y = 6}, [49] = {x = 7, y = 0}, [50] = {x = 7, y = 1}, [51] = {x = 7, y = 2}, [52] = {x = 7, y = 3}, [53] = {x = 7, y = 4}, [54] = {x = 7,
y = 5}, [55] = {x = 7, y = 6}}
@kkwong10 and @abhullar-tt do we have any other multi-core tests in post-commit?
This is the first one I saw, and it hangs non-deterministically, it failed on the 6th iteration of post-commit.
Do we support multi-core tests now on WH? Or should I fixup and merge the single core mode changes I was working on previously? (When enabled and ops request grid size, it would return a 1x1 grid). Then we could just force all the op WH tests to set this env var until multi-core is more tested and selectively remove the flag? Might also help increase wh tests as currently wh tests are only specific shapes that would trigger single core (most are just single tile tests I think), but with the flag set you should be able to run with all the original unit test shapes unless there is some other bug.
Do we support multi-core tests now on WH? Or should I fixup and merge the single core mode changes I was working on previously? (When enabled and ops request grid size, it would return a 1x1 grid). Then we could just force all the op WH tests to set this env var until multi-core is more tested and selectively remove the flag? Might also help increase wh tests as currently wh tests are only specific shapes that would trigger single core (most are just single tile tests I think), but with the flag set you should be able to run with all the original unit test shapes unless there is some other bug.
I think it would be great to have ability to force single-core. We'd still want to run multi-core, as of now there's no reason that they shouldn't work -- but there could be a bug and we need to de-couple single-core debug vs. multi-core debug.
We should do progressive WH testing:
Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 09:37:50.214 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-16 09:37:50.268 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 09:37:50.294 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-16 09:37:50.775 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 09:37:50.776 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
@kkwong10 and @abhullar-tt do we have any other multi-core tests in post-commit?
This is the first one I saw, and it hangs non-deterministically, it failed on the 6th iteration of post-commit.
All of the c++ unit tests are single core
Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.
===================================================================================== test session starts ===================================================================================== platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python cachedir: .pytest_cache rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini plugins: dash-2.8.1 collected 2 items tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0 CHECKING: [0, 0, 0, 0] CHECKING: [1, 0, 0, 0] Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml 2023-09-16 09:37:50.214 | INFO | SiliconDriver - Detected 1 PCI device 2023-09-16 09:37:50.268 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1) 2023-09-16 09:37:50.294 | INFO | SiliconDriver - Disable PCIE DMA LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz 2023-09-16 09:37:50.775 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0 2023-09-16 09:37:50.776 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
Ivan mentioned that physical coords + translation tables enabled should work on device but BBE hit an issue with this config. Could we be having the same problems since we aren't using the HW translated coordinates?
Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.
===================================================================================== test session starts ===================================================================================== platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python cachedir: .pytest_cache rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini plugins: dash-2.8.1 collected 2 items tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0 CHECKING: [0, 0, 0, 0] CHECKING: [1, 0, 0, 0] Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml 2023-09-16 09:37:50.214 | INFO | SiliconDriver - Detected 1 PCI device 2023-09-16 09:37:50.268 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1) 2023-09-16 09:37:50.294 | INFO | SiliconDriver - Disable PCIE DMA LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz 2023-09-16 09:37:50.775 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0 2023-09-16 09:37:50.776 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
Ivan mentioned that physical coords + translation tables enabled should work on device but BBE hit an issue with this config. Could we be having the same problems since we aren't using the HW translated coordinates?
Perhaps translation is the issue -- or this is just a slow dispatch bug? If translation issue, would this test work at all -- it hangs only on the 8th iteration of post-commit.
Based on what Ivan was saying, using physical is fine, they go through the same HW table in front of the router, and physical range of coords is 1:1 in the table, and then the router works with physical coords anyway.
We should investigate this more before switch to translated.
Btw GS + slow dispatch is also hanging (a different single-core test). I talked to @DrJessop , and we're going to a straight fast dispatch port to WH, to rule out slow dispatch issues.
I was able to repro unpad hang by running it by itself. It hung in 122nd iteration. Two main suspects are slow dispatch bug and harvesting. I will dig deeper to see which core is hanging-- and if there's a pattern to itl.
023-09-16 19:15:30.494 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:31.636 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-16 19:15:31.637 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:31.637 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - unpad test passed with input shape [[5, 5, 50, 50]].
PASSED Op | INFO | Program Cache: disabled and cleared.
Metal | INFO | Closing device 0
=========================================================================================================== PASSES ===========================================================================================================
================================================================================================== short test summary info ===================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]
===================================================================================================== 1 passed in 3.12s ======================================================================================================
Metal | INFO | Closing device driver
==================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 1 item
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 19:15:34.737 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-16 19:15:34.799 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 19:15:34.825 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-16 19:15:35.260 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 19:15:35.261 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:36.425 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-16 19:15:36.425 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:36.425 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - unpad test passed with input shape [[5, 5, 50, 50]].
PASSED Op | INFO | Program Cache: disabled and cleared.
Metal | INFO | Closing device 0
=========================================================================================================== PASSES ===========================================================================================================
================================================================================================== short test summary info ===================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]
===================================================================================================== 1 passed in 3.12s ======================================================================================================
Metal | INFO | Closing device driver
==================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 1 item
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 19:15:39.517 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-16 19:15:39.574 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 19:15:39.599 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-16 19:15:40.024 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 19:15:40.025 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
One interesting thing I caught in top
. The slow dispatch test that is stuck polling the device is reported as using 100.3%
of a CPU core. I wonder if the polling thread is being put to sleep and also moved around cores.
Or if this is just a math noise/error by OS/top.
@pgkeller , just wondering if you saw something like this.
Or 100.3%
is because we have 79 threads, and some of them occasionally wake-up and use up the CPU?
Probably high chance that the polling thread is being moved across the cores, I wonder is that's interfering with polling of the device.
@TT-billteng I think it's worth a try running slow dispatch w/ 1 thread.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f1e5390c740 (LWP 1155321) "pytest" 0x00007f1e35d7e4ba in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
2 Thread 0x7f1e32d1e700 (LWP 1155323) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354582e0 <thread_status+96>) at ../sysdeps/nptl/futex-internal.h:183
3 Thread 0x7f1e3251d700 (LWP 1155324) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458360 <thread_status+224>) at ../sysdeps/nptl/futex-internal.h:183
4 Thread 0x7f1e31d1c700 (LWP 1155325) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354583e0 <thread_status+352>) at ../sysdeps/nptl/futex-internal.h:183
5 Thread 0x7f1e3151b700 (LWP 1155326) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458460 <thread_status+480>) at ../sysdeps/nptl/futex-internal.h:183
6 Thread 0x7f1e30d1a700 (LWP 1155327) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354584e0 <thread_status+608>) at ../sysdeps/nptl/futex-internal.h:183
7 Thread 0x7f1e28519700 (LWP 1155328) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458560 <thread_status+736>) at ../sysdeps/nptl/futex-internal.h:183
8 Thread 0x7f1e27d18700 (LWP 1155329) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354585e0 <thread_status+864>) at ../sysdeps/nptl/futex-internal.h:183
9 Thread 0x7f1e27517700 (LWP 1155330) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458660 <thread_status+992>) at ../sysdeps/nptl/futex-internal.h:183
10 Thread 0x7f1e24d16700 (LWP 1155331) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354586e0 <thread_status+1120>) at ../sysdeps/nptl/futex-internal.h:183
11 Thread 0x7f1e24515700 (LWP 1155332) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458760 <thread_status+1248>) at ../sysdeps/nptl/futex-internal.h:183
12 Thread 0x7f1e1fd14700 (LWP 1155333) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354587e0 <thread_status+1376>) at ../sysdeps/nptl/futex-internal.h:183
13 Thread 0x7f1e19513700 (LWP 1155334) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458860 <thread_status+1504>) at ../sysdeps/nptl/futex-internal.h:183
14 Thread 0x7f1e16d12700 (LWP 1155335) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354588e0 <thread_status+1632>) at ../sysdeps/nptl/futex-internal.h:183
15 Thread 0x7f1e14511700 (LWP 1155336) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458960 <thread_status+1760>) at ../sysdeps/nptl/futex-internal.h:183
16 Thread 0x7f1e13d10700 (LWP 1155337) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354589e0 <thread_status+1888>) at ../sysdeps/nptl/futex-internal.h:183
17 Thread 0x7f1e1350f700 (LWP 1155338) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458a60 <thread_status+2016>) at ../sysdeps/nptl/futex-internal.h:183
18 Thread 0x7f1e10d0e700 (LWP 1155339) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ae0 <thread_status+2144>) at ../sysdeps/nptl/futex-internal.h:183
19 Thread 0x7f1e1050d700 (LWP 1155340) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458b60 <thread_status+2272>) at ../sysdeps/nptl/futex-internal.h:183
20 Thread 0x7f1e0fd0c700 (LWP 1155341) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458be0 <thread_status+2400>) at ../sysdeps/nptl/futex-internal.h:183
21 Thread 0x7f1e0f50b700 (LWP 1155342) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458c60 <thread_status+2528>) at ../sysdeps/nptl/futex-internal.h:183
22 Thread 0x7f1e0ed0a700 (LWP 1155343) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ce0 <thread_status+2656>) at ../sysdeps/nptl/futex-internal.h:183
23 Thread 0x7f1e0e509700 (LWP 1155344) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458d60 <thread_status+2784>) at ../sysdeps/nptl/futex-internal.h:183
24 Thread 0x7f1e0dd08700 (LWP 1155345) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458de0 <thread_status+2912>) at ../sysdeps/nptl/futex-internal.h:183
25 Thread 0x7f1e0d507700 (LWP 1155346) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458e60 <thread_status+3040>) at ../sysdeps/nptl/futex-internal.h:183
26 Thread 0x7f1e0cd06700 (LWP 1155347) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ee0 <thread_status+3168>) at ../sysdeps/nptl/futex-internal.h:183
27 Thread 0x7f1e0c505700 (LWP 1155348) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458f60 <thread_status+3296>) at ../sysdeps/nptl/futex-internal.h:183
28 Thread 0x7f1e0bd04700 (LWP 1155349) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458fe0 <thread_status+3424>) at ../sysdeps/nptl/futex-internal.h:183
29 Thread 0x7f1e0b503700 (LWP 1155350) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35459060 <thread_status+3552>) at ../sysdeps/nptl/futex-internal.h:183
30 Thread 0x7f1e0ad02700 (LWP 1155351) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354590e0 <thread_status+3680>) at ../sysdeps/nptl/futex-internal.h:183
31 Thread 0x7f1e0a501700 (LWP 1155352) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35459160 <thread_status+3808>) at ../sysdeps/nptl/futex-internal.h:183
32 Thread 0x7f1e09d00700 (LWP 1155353) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354591e0 <thread_status+3936>) at ../sysdeps/nptl/futex-internal.h:183
33 Thread 0x7f1dbbfb4700 (LWP 1155548) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
34 Thread 0x7f1dbc7b5700 (LWP 1155549) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
35 Thread 0x7f1dbcfb6700 (LWP 1155550) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
36 Thread 0x7f1d3bfff700 (LWP 1155551) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
37 Thread 0x7f1dbb7b3700 (LWP 1155552) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
38 Thread 0x7f1dbafb2700 (LWP 1155553) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
39 Thread 0x7f1dba7b1700 (LWP 1155554) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
40 Thread 0x7f1db9fb0700 (LWP 1155555) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
41 Thread 0x7f1db97af700 (LWP 1155556) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
42 Thread 0x7f1db8fae700 (LWP 1155557) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
43 Thread 0x7f1db3fff700 (LWP 1155558) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
44 Thread 0x7f1db37fe700 (LWP 1155559) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
45 Thread 0x7f1db2ffd700 (LWP 1155560) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
46 Thread 0x7f1db27fc700 (LWP 1155561) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
47 Thread 0x7f1db1ffb700 (LWP 1155562) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
48 Thread 0x7f1db17fa700 (LWP 1155760) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb1a8) at ../sysdeps/nptl/futex-internal.h:183
49 Thread 0x7f1db0ff9700 (LWP 1155761) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb218) at ../sysdeps/nptl/futex-internal.h:183
50 Thread 0x7f1d08ff9700 (LWP 1155762) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb288) at ../sysdeps/nptl/futex-internal.h:183
51 Thread 0x7f1d0bfff700 (LWP 1155763) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb2f8) at ../sysdeps/nptl/futex-internal.h:183
52 Thread 0x7f1d3b7fe700 (LWP 1155764) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb368) at ../sysdeps/nptl/futex-internal.h:183
53 Thread 0x7f1d3affd700 (LWP 1155765) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb3d8) at ../sysdeps/nptl/futex-internal.h:183
54 Thread 0x7f1d3a7fc700 (LWP 1155766) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb448) at ../sysdeps/nptl/futex-internal.h:183
55 Thread 0x7f1d39ffb700 (LWP 1155767) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb4b8) at ../sysdeps/nptl/futex-internal.h:183
56 Thread 0x7f1d397fa700 (LWP 1155768) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb528) at ../sysdeps/nptl/futex-internal.h:183
57 Thread 0x7f1d38ff9700 (LWP 1155769) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb598) at ../sysdeps/nptl/futex-internal.h:183
58 Thread 0x7f1d13fff700 (LWP 1155770) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb608) at ../sysdeps/nptl/futex-internal.h:183
59 Thread 0x7f1d137fe700 (LWP 1155771) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb678) at ../sysdeps/nptl/futex-internal.h:183
60 Thread 0x7f1d12ffd700 (LWP 1155772) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb6e8) at ../sysdeps/nptl/futex-internal.h:183
61 Thread 0x7f1d127fc700 (LWP 1155773) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb758) at ../sysdeps/nptl/futex-internal.h:183
62 Thread 0x7f1d11ffb700 (LWP 1155774) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb7c8) at ../sysdeps/nptl/futex-internal.h:183
63 Thread 0x7f1d117fa700 (LWP 1155775) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb838) at ../sysdeps/nptl/futex-internal.h:183
64 Thread 0x7f1d10ff9700 (LWP 1155776) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb8a8) at ../sysdeps/nptl/futex-internal.h:183
65 Thread 0x7f1d0b7fe700 (LWP 1155777) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb918) at ../sysdeps/nptl/futex-internal.h:183
66 Thread 0x7f1d0affd700 (LWP 1155778) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb988) at ../sysdeps/nptl/futex-internal.h:183
67 Thread 0x7f1d0a7fc700 (LWP 1155779) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb9f8) at ../sysdeps/nptl/futex-internal.h:183
68 Thread 0x7f1d09ffb700 (LWP 1155780) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfba68) at ../sysdeps/nptl/futex-internal.h:183
69 Thread 0x7f1d097fa700 (LWP 1155781) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbad8) at ../sysdeps/nptl/futex-internal.h:183
--Type <RET> for more, q to quit, c to continue without paging-- c
70 Thread 0x7f1d03fff700 (LWP 1155782) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbb48) at ../sysdeps/nptl/futex-internal.h:183
71 Thread 0x7f1d037fe700 (LWP 1155783) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbbb8) at ../sysdeps/nptl/futex-internal.h:183
72 Thread 0x7f1d02ffd700 (LWP 1155784) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbc28) at ../sysdeps/nptl/futex-internal.h:183
73 Thread 0x7f1d027fc700 (LWP 1155785) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbc98) at ../sysdeps/nptl/futex-internal.h:183
74 Thread 0x7f1d01ffb700 (LWP 1155786) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbd08) at ../sysdeps/nptl/futex-internal.h:183
75 Thread 0x7f1d017fa700 (LWP 1155787) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbd78) at ../sysdeps/nptl/futex-internal.h:183
76 Thread 0x7f1d00ff9700 (LWP 1155788) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbde8) at ../sysdeps/nptl/futex-internal.h:183
77 Thread 0x7f1cebfff700 (LWP 1155789) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbe5c) at ../sysdeps/nptl/futex-internal.h:183
78 Thread 0x7f1ceb7fe700 (LWP 1155790) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbecc) at ../sysdeps/nptl/futex-internal.h:183
79 Thread 0x7f1ceaffd700 (LWP 1155791) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbf38) at ../sysdeps/nptl/futex-internal.h:183
(gdb)
Using the new polling scheme I got this data: the entire row y=2 hangs (8-core) on a 56 core unpad, so very likely a harvesting bug.
It's a non-deterministic hang, as expected from a harvesting bug, it hangs on 144th iteration of this test.
This PR has the new polling scheme: https://github.com/tenstorrent-metal/tt-metal/pull/2696 We need to review our harvesting.
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
The cores are polled in the order shown below, as they happen to be in set / RB tree. This actually helped, because we effectively poll in "random" order, to exclude polling core order from a potential suspect.
Out of this random order emerges row y=2 as the hanging row.
Not done cores: (x=7,y=4) (x=5,y=5) (x=3,y=6) (x=6,y=4) (x=4,y=5) (x=2,y=6) (x=7,y=5) (x=5,y=4) (x=1,y=6) (x=7,y=6) (x=3,y=4) (x=1,y=5) (x=5,y=6) (x=3,y=5) (x=1,y=4) (x=7,y=0) (x=5,y=1) (x=3,y=2) (x=1,y=3) (x=7,y=1) (x=5,y=0) (x=3,y=3) (x=1,y=2) (x=7,y=2) (x=5,y=3) (x=3,y=0) (x=1,y=1) (x=7,y=3) (x=5,y=2) (x=3,y=1) (x=1,y=0) (x=6,y=5) (x=4,y=4) (x=0,y=6) (x=6,y=6) (x=2,y=4) (x=0,y=5) (x=4,y=6) (x=2,y=5) (x=0,y=4) (x=6,y=0) (x=4,y=1) (x=2,y=2) (x=0,y=3) (x=6,y=1) (x=4,y=0) (x=2,y=3) (x=0,y=2) (x=6,y=2) (x=4,y=3) (x=2,y=0) (x=0,y=1) (x=6,y=3) (x=4,y=2) (x=2,y=1) (x=0,y=0)
was able to repro the same hang again on 71st iteration:
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
great find!
same hang signature on 2 different N300 machines: t3005
and t3002
.
was able to repro the exact same hang on t3002
, it hung on 832nd iteration of the test. it's interesting the exact same "logical row" hangs. will dig into the physical coordinates next.
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
on t3005
:
(gdb) print core
$8 = (const tt_xy_pair &) @0x5dc2788: {x = 5, y = 2}
(gdb) call device->worker_core_from_logical_core(core)
$9 = {x = 7, y = 3}
on t3002
(same mapping):
hanging_core, logical = (x=5,y=2), routing = (x=7,y=3)
as per @abhullar-tt suggestion, I added prints to row extraction function.
based on the print and this Aditya's comment:
// For Wormhole, we always remove the last few rows in the SOC descriptor in case of harvesting
on N300 (2-row harvesting) UMD will just remove last 2 two rows (10,11) from the SOC descriptor, which means the list of worker cores in the SOC from UMD's perspective is "virtual" , ie it won't take out actually harvested rows. the actual harvested rows do seem to be removed on GS. On WH removal is virtual.
See examples on t3002
, t3005
, UMD just removes 10/11 , but different rows are harvested.
t3002
:
Rows to remove: (extracted = 9, what_UMD_removes = 11) (extracted = 11, what_UMD_removes = 10)
t3005
:
worker_grid_rows = 11
Rows to remove: (extracted = 8, what_UMD_removes = 11) (extracted = 11, what_UMD_removes = 10)
std::vector<int> tt_SiliconDevice::extract_rows_to_remove(const tt::ARCH &arch, const int worker_grid_rows, const int harvested_rows) {
// Check if harvesting config is legal for GS and WH
tt_device_logger::log_assert(!((harvested_rows & 1) || (harvested_rows & 64) || (harvested_rows & 0xFFFFF000)), "For grayskull and wormhole, only rows 1-5 and 7-11 can be harvested");
std::vector<int> row_coordinates_to_remove;
int row_coordinate = 0;
int tmp = harvested_rows;
while (tmp) {
if (tmp & 1)
row_coordinates_to_remove.push_back(row_coordinate);
tmp = tmp >> 1;
row_coordinate++;
}
std::cerr << "Rows to remove: ";
if (arch == tt::ARCH::WORMHOLE || arch == tt::ARCH::WORMHOLE_B0) {
// For Wormhole, we always remove the last few rows in the SOC descriptor in case of harvesting
for (int i = 0; i < row_coordinates_to_remove.size(); i++) {
std::cout << "(extracted = " << row_coordinates_to_remove[i] << ", ";
row_coordinates_to_remove[i] = worker_grid_rows - i;
std::cout << "what_UMD_removes = " << row_coordinates_to_remove[i] << ") ";
}
}
std::cerr << std::endl;
return row_coordinates_to_remove;
}
from wormhole_b0_8x10.yaml
, same is in wormhome_b0_80.yaml
UMD will always remove the last two rows.
functional_workers:
[
1-1, 2-1, 3-1, 4-1, 6-1, 7-1, 8-1, 9-1,
1-2, 2-2, 3-2, 4-2, 6-2, 7-2, 8-2, 9-2,
1-3, 2-3, 3-3, 4-3, 6-3, 7-3, 8-3, 9-3,
1-4, 2-4, 3-4, 4-4, 6-4, 7-4, 8-4, 9-4,
1-5, 2-5, 3-5, 4-5, 6-5, 7-5, 8-5, 9-5,
1-7, 2-7, 3-7, 4-7, 6-7, 7-7, 8-7, 9-7,
1-8, 2-8, 3-8, 4-8, 6-8, 7-8, 8-8, 9-8,
1-9, 2-9, 3-9, 4-9, 6-9, 7-9, 8-9, 9-9,
1-10, 2-10, 3-10, 4-10, 6-10, 7-10, 8-10, 9-10,
1-11, 2-11, 3-11, 4-11, 6-11, 7-11, 8-11, 9-11,
]
Cores that we're polling ont3005
:
Not done cores (logical): (x=7,y=4) (x=5,y=5) (x=3,y=6) (x=6,y=4) (x=4,y=5) (x=2,y=6) (x=7,y=5) (x=5,y=4) (x=1,y=6) (x=7,y=6) (x=3,y=4) (x=1,y=5) (x=5,y=6) (x=3,y=5) (x=1,y=4) (x=7,y=0) (x=5,y=1) (x=3,y=2) (x=1,y=3) (x=7,y=1) (x=5,y=0) (x=3,y=3) (x=1,y=2) (x=7,y=2) (x=5,y=3) (x=3,y=0) (x=1,y=1) (x=7,y=3) (x=5,y=2) (x=3,y=1) (x=1,y=0) (x=6,y=5) (x=4,y=4) (x=0,y=6) (x=6,y=6) (x=2,y=4) (x=0,y=5) (x=4,y=6) (x=2,y=5) (x=0,y=4) (x=6,y=0) (x=4,y=1) (x=2,y=2) (x=0,y=3) (x=6,y=1) (x=4,y=0) (x=2,y=3) (x=0,y=2) (x=6,y=2) (x=4,y=3) (x=2,y=0) (x=0,y=1) (x=6,y=3) (x=4,y=2) (x=2,y=1) (x=0,y=0)
Not done cores (physical routing): (x=9,y=5) (x=7,y=7) (x=4,y=9) (x=8,y=5) (x=6,y=7) (x=3,y=9) (x=9,y=7) (x=7,y=5) (x=2,y=9) (x=9,y=9) (x=4,y=5) (x=2,y=7) (x=7,y=9) (x=4,y=7) (x=2,y=5) (x=9,y=1) (x=7,y=2) (x=4,y=3) (x=2,y=4) (x=9,y=2) (x=7,y=1) (x=4,y=4) (x=2,y=3) (x=9,y=3) (x=7,y=4) (x=4,y=1) (x=2,y=2) (x=9,y=4) (x=7,y=3) (x=4,y=2) (x=2,y=1) (x=8,y=7) (x=6,y=5) (x=1,y=9) (x=8,y=9) (x=3,y=5) (x=1,y=7) (x=6,y=9) (x=3,y=7) (x=1,y=5) (x=8,y=1) (x=6,y=2) (x=3,y=3) (x=1,y=4) (x=8,y=2) (x=6,y=1) (x=3,y=4) (x=1,y=3) (x=8,y=3) (x=6,y=4) (x=3,y=1) (x=1,y=2) (x=8,y=4) (x=6,y=3) (x=3,y=2) (x=1,y=1)
Not done cores (UMD virtual): (chip=0,x=9,y=5) (chip=0,x=7,y=7) (chip=0,x=4,y=8) (chip=0,x=8,y=5) (chip=0,x=6,y=7) (chip=0,x=3,y=8) (chip=0,x=9,y=7) (chip=0,x=7,y=5) (chip=0,x=2,y=8) (chip=0,x=9,y=8) (chip=0,x=4,y=5) (chip=0,x=2,y=7) (chip=0,x=7,y=8) (chip=0,x=4,y=7) (chip=0,x=2,y=5) (chip=0,x=9,y=1) (chip=0,x=7,y=2) (chip=0,x=4,y=3) (chip=0,x=2,y=4) (chip=0,x=9,y=2) (chip=0,x=7,y=1) (chip=0,x=4,y=4) (chip=0,x=2,y=3) (chip=0,x=9,y=3) (chip=0,x=7,y=4) (chip=0,x=4,y=1) (chip=0,x=2,y=2) (chip=0,x=9,y=4) (chip=0,x=7,y=3) (chip=0,x=4,y=2) (chip=0,x=2,y=1) (chip=0,x=8,y=7) (chip=0,x=6,y=5) (chip=0,x=1,y=8) (chip=0,x=8,y=8) (chip=0,x=3,y=5) (chip=0,x=1,y=7) (chip=0,x=6,y=8) (chip=0,x=3,y=7) (chip=0,x=1,y=5) (chip=0,x=8,y=1) (chip=0,x=6,y=2) (chip=0,x=3,y=3) (chip=0,x=1,y=4) (chip=0,x=8,y=2) (chip=0,x=6,y=1) (chip=0,x=3,y=4) (chip=0,x=1,y=3) (chip=0,x=8,y=3) (chip=0,x=6,y=4) (chip=0,x=3,y=1) (chip=0,x=1,y=2) (chip=0,x=8,y=4) (chip=0,x=6,y=3) (chip=0,x=3,y=2) (chip=0,x=1,y=1)
Rows 8, 10, 11 are removed from the physical routing list (worker_core_from_logical_core
).
8, 11 presumably based on the extracted harvesting info, and then 10 is for storage/dispatch.
Virtual list has 9, 10, 11 removed. 10/11 for harvesting and then 9 for storage/dispatch.
So far in these printouts I don't see a harvesting bug.
But it is very sus that we hang we always hang on that same row, on both t3002
/ t3005
.
I was able to repro it ~5 times so far.
Also printed cores for which we're setting up static TLBs. This seem be done via UMD virtual coords: 1-5, 7-9 for workers, and then 0, 6 added for ETH. I assume this is what the UMD expects, tlbs to be programmed via virtual coords, as tlbs are looked up via virtual coords.
setting up TLB for cores:
(x=1,y=1) (x=2,y=1) (x=3,y=1) (x=4,y=1) (x=6,y=1) (x=7,y=1) (x=8,y=1) (x=9,y=1) (x=1,y=2) (x=2,y=2) (x=3,y=2) (x=4,y=2) (x=6,y=2) (x=7,y=2) (x=8,y=2) (x=9,y=2) (x=1,y=3) (x=2,y=3) (x=3,y=3) (x=4,y=3) (x=6,y=3) (x=7,y=3) (x=8,y=3) (x=9,y=3) (x=1,y=4) (x=2,y=4) (x=3,y=4) (x=4,y=4) (x=6,y=4) (x=7,y=4) (x=8,y=4) (x=9,y=4) (x=1,y=5) (x=2,y=5) (x=3,y=5) (x=4,y=5) (x=6,y=5) (x=7,y=5) (x=8,y=5) (x=9,y=5) (x=1,y=7) (x=2,y=7) (x=3,y=7) (x=4,y=7) (x=6,y=7) (x=7,y=7) (x=8,y=7) (x=9,y=7) (x=1,y=8) (x=2,y=8) (x=3,y=8) (x=4,y=8) (x=6,y=8) (x=7,y=8) (x=8,y=8) (x=9,y=8) (x=1,y=9) (x=2,y=9) (x=3,y=9) (x=4,y=9) (x=6,y=9) (x=7,y=9) (x=8,y=9) (x=9,y=9) (x=9,y=0) (x=1,y=0) (x=8,y=0) (x=2,y=0) (x=7,y=0) (x=3,y=0) (x=6,y=0) (x=4,y=0) (x=9,y=6) (x=1,y=6) (x=8,y=6) (x=2,y=6) (x=7,y=6) (x=3,y=6) (x=6,y=6) (x=4,y=6)
@abhullar-tt @kkwong10 we're doing std::int32_t get_static_tlb_index(CoreCoord target)
in tt_cluster.cpp
, and all of TLB maps use virtual UMD coords.
I wonder if there's something to do with which TLB we use to poll / and which one should be used?
Below it suggests that there is a 2MB polling TLB, and that TLB setup is aligned with KMD's WC/UC split.
// MEM_*_TLB are for dynamic read/writes to memory, either 16MB (large read/writes) or 2MB (polling). REG_TLB for dynamic writes
// to registers. They are aligned with the kernel driver's WC/UC split. But kernel driver uses different TLB's for these.
static constexpr unsigned int REG_TLB = TLB_BASE_INDEX_16M + 18;
static constexpr unsigned int MEM_LARGE_WRITE_TLB = TLB_BASE_INDEX_16M + 17;
static constexpr unsigned int MEM_LARGE_READ_TLB = TLB_BASE_INDEX_16M + 0;
static constexpr unsigned int MEM_SMALL_READ_WRITE_TLB = TLB_BASE_INDEX_2M + 1;
static constexpr uint32_t DYNAMIC_TLB_BASE_INDEX = MEM_LARGE_READ_TLB + 1;
static constexpr uint32_t INTERNAL_TLB_INDEX = DYNAMIC_TLB_BASE_INDEX + DYNAMIC_TLB_COUNT; // pcie_write_xy and similar
static constexpr uint32_t DRAM_CHANNEL_0_X = 0;
static constexpr uint32_t DRAM_CHANNEL_0_Y = 0;
static constexpr uint32_t DRAM_CHANNEL_0_PEER2PEER_REGION_START = 0x30000000; // This is the last 256MB of DRAM
@abhullar-tt @kkwong10 @pgkeller @DrJessop -- lots of interesting data. Although I wouldn't rule out harvesting, what's really sus is the exact same row is hanging on 2 N300's , and they have different rows harvested, and the row that's hanging is not harvested. So what is regular and setup the same across device -- one thing I can think of is TLBs, we have static and 4 dynamic types (LARGE_READ_TLB , LARGE_WRITE_TLB, SMALL_READ_WRITE_TLB, REG_TLB) and different TLBs have different WC/UC mappings. My suspicion is that any resets (reg writes), and "go" signals , any "polling" should be done via UC writes/reads. This doesn't seem to be the case today.
1) Writing a reset via "LARGE_WRITE_TLB" is the one we use currently, and the only one I was able to try, I get a crash for every other option, so I could't really see what the effect of other types of TLBs be.
//device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "REG_TLB"); // crash
//device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "SMALL_READ_WRITE_TLB"); // crash
device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB");
2) Reseting the full device results in mis-matches. Are we still downloading blanks to un-used cores / storage cores? I think we should get this to work, for several reasons:
//cluster->deassert_risc_reset(device_id, false);
// which calls
// device->deassert_risc_reset(target_device_id);
3) Writing to the reset reg 2 times in a row to the same core results in very frequent hangs -- and number of cores that hang varies, it can be 20 - 50. Writing to this reg with the de-assert mask starts BRISC but keeps in reset NCRISC/TRISC, so I'm guessing is that the first write kick-off BRISC, and then BRISC itself kicks off NCRISC, TRISC, but then the 2nd write comes and asserts reset on NCRISC / TRISC so they never finish. The variability in timing is interesting, because sometimes these arrive close enough that there is no hang.
device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB");
_mm_sfence();
// device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB");
_mm_sfence();
4) Polling the reset register to check if the value was written hang (unless there's a bug in this code). This is unexpected, it would be great to be able to do this, to confirm that BRISC on a core did come out of reset. I hope this can be made to work.
tt_cxy_pair virtual_core = this->convert_physical_cxy_to_virtual(core);
device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB");
_mm_mfence();
std::vector<uint32_t> vec_read(1);
vec_read[0] = 0xfacafaca;
while (vec_read[0] != vec[0]) {
device->read_from_device(vec_read.data(), virtual_core, 0xFFB121B0, vec.size(), "LARGE_READ_TLB"); // hangs
}
5) based on experiment 3, (back to back writes to reset) -- I made the 2nd write just send a 0. This should mean don't reset any RISCV, so my expectation is a No-op. The test passed once, and then 2nd time I get a 0 in the run mailbox when checking done. That is very sus, before de-asserting resets we write the mailbox and check its value.
Read unexpected run_mailbox value: 0 (expected 2a or 1)
@abhullar-tt @kkwong10 @pgkeller @DrJessop -- lots of interesting data. Although I wouldn't rule out harvesting, what's really sus is the exact same row is hanging on 2 N300's , and they have different rows harvested, and the row that's hanging is not harvested. So what is regular and setup the same across device -- one thing I can think of is TLBs, we have static and 4 dynamic types (LARGE_READ_TLB , LARGE_WRITE_TLB, SMALL_READ_WRITE_TLB, REG_TLB) and different TLBs have different WC/UC mappings. My suspicion is that any resets (reg writes), and "go" signals , any "polling" should be done via UC writes/reads. This doesn't seem to be the case today.
- Writing a reset via "LARGE_WRITE_TLB" is the one we use currently, and the only one I was able to try, I get a crash for every other option, so I could't really see what the effect of other types of TLBs be.
//device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "REG_TLB"); // crash //device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "SMALL_READ_WRITE_TLB"); // crash device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB");
UMD constructor (see cluster->open_device) accepts dynamic TLB config as a param (string -> int map). By default the constructor adds:
// It is mandatory for all devices to have these TLBs set aside, as the driver needs them to issue remote reads and writes.
dynamic_tlb_config["LARGE_READ_TLB"] = DEVICE_DATA.MEM_LARGE_READ_TLB;
dynamic_tlb_config["LARGE_WRITE_TLB"] = DEVICE_DATA.MEM_LARGE_WRITE_TLB;
In this experiment was the dynamic tlb config uplifted? If we want to use it then we should want to use REG_TLB
and MEM_SMALL_READ_WRITE_TLB
in open device we should uplift:
uint32_t num_host_mem_ch_per_mmio_device = 1;
std::unordered_map<std::string, std::int32_t> dynamic_tlb_config = {
{"REG_TLB", DEVICE_DATA.REG_TLB},
{"SMALL_READ_WRITE_TLB", DEVICE_DATA.MEM_SMALL_READ_WRITE_TLB}
};
// This will remove harvested rows from the soc descriptor
const bool perform_harvesting = true;
device = std::make_unique<tt_SiliconDevice>(sdesc_path, ndesc_path, target_device_ids, num_host_mem_ch_per_mmio_device, dynamic_tlb_config, skip_driver_allocs, perform_harvesting);
During initial port to open-UMD we had decided to use the default.
Reseting the full device results in mis-matches. Are we still downloading blanks to un-used cores / storage cores? I think we should get this to work, for several reasons:
- in UMD it's actually implemented using REG_TLB and a full chip broadcast. I think REG_TLB matters (UC?), broadcast may not matter
- this is what BBE uses
- this is what metal used and then we changed to LARGE_WRITE_TLB
- so we went from "REG_TLB + broadcast" to "LARGE_WRITE_TLB + unicast"
//cluster->deassert_risc_reset(device_id, false); // which calls // device->deassert_risc_reset(target_device_id);
We are writing blanks to all un-used cores but not to storage only cores. Was this modified in launch kernels? Maybe the hang here is because we aren't taking all of the cores out of reset anymore (only the ones used in program)?
- Writing to the reset reg 2 times in a row to the same core results in very frequent hangs -- and number of cores that hang varies, it can be 20 - 50. Writing to this reg with the de-assert mask starts BRISC but keeps in reset NCRISC/TRISC, so I'm guessing is that the first write kick-off BRISC, and then BRISC itself kicks off NCRISC, TRISC, but then the 2nd write comes and asserts reset on NCRISC / TRISC so they never finish. The variability in timing is interesting, because sometimes these arrive close enough that there is no hang.
device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); _mm_sfence(); // device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); _mm_sfence();
- Polling the reset register to check if the value was written hang (unless there's a bug in this code). This is unexpected, it would be great to be able to do this, to confirm that BRISC on a core did come out of reset. I hope this can be made to work.
tt_cxy_pair virtual_core = this->convert_physical_cxy_to_virtual(core); device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); _mm_mfence(); std::vector<uint32_t> vec_read(1); vec_read[0] = 0xfacafaca; while (vec_read[0] != vec[0]) { device->read_from_device(vec_read.data(), virtual_core, 0xFFB121B0, vec.size(), "LARGE_READ_TLB"); // hangs }
size to pass into read_from_device
should be vec.size() * sizeof(uint32_t)
- based on experiment 3, (back to back writes to reset) -- I made the 2nd write just send a 0. This should mean don't reset any RISCV, so my expectation is a No-op. The test passed once, and then 2nd time I get a 0 in the run mailbox when checking done. That is very sus, before de-asserting resets we write the mailbox and check its value.
Read unexpected run_mailbox value: 0 (expected 2a or 1)
The same test hangs on the latest main after fw-at-init changes, hash = 2229e00055af18b999a85df895bc94af098cc3d1 I haven't ported the the polling simplification + debug hanging core change yet -- so I don't have the hang pattern data yet.
It hung on the 4th iteration post-commit stress, it does seem it's easier to make it hang as part of post-commit. It crashed on the 414th iteration of a stand-alone test run -- not sure if related, filed a separate case: https://github.com/tenstorrent-metal/tt-metal/issues/2700
Posting a longer test trace to see what ran just before, in case the system state turns out to be relevant. I'm guess it matters since 4th iteration of post-commit vs. 414th standalone.
2023-09-18 06:08:06.853 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:06.335 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-18 06:08:06.393 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:06.394 | WARNING | SiliconDriver - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 156649 tid: 140524614661952)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:06.418 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-18 06:08:06.853 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:06.853 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.912 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.913 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.913 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-wh test passed with input shape [[1, 1, 32, 32]].
PASSED Op | INFO | Program Cache: disabled and cleared.
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0] 2023-09-18 06:08:07.918 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:07.918 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.920 | DEBUG | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-cn test passed with input shape [[1, 1, 32, 32]].
PASSED Op | INFO | Program Cache: disabled and cleared.
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes1] SKIPPED (not working for Wormhole B0) Metal | INFO | Closing device 0
=============================================================================================================== PASSES ================================================================================================================
======================================================================================================= short test summary info =======================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0]
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0]
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:49: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:98: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:121: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:144: not working for Wormhole B0
==================================================================================================== 2 passed, 8 skipped in 3.02s =====================================================================================================
Metal | INFO | Closing device driver
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:10.968 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-18 06:08:11.025 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:11.025 | WARNING | SiliconDriver - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 157310 tid: 140332769195840)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:11.050 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-18 06:08:11.476 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-18 06:08:11.476 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
I see the debug strategy; thanks @Davor Capalija @.***>
On Sat, Sep 16, 2023 at 1:48 AM Davor Capalija @.***> wrote:
@tt-aho https://github.com/tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?
@muthutt https://github.com/muthutt -- gdb is reporting that we're waiting on core {x = 7, y = 1} , so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.
(gdb) up
4 0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=
, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access= ) at /usr/include/c++/9/bits/char_traits.h:300
300 { c1 = c2; } (gdb) info locals chip_id =
virtual_dram_core = { = {x = 7, y = 1}, chip = 0} (gdb) up 5 0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=
, chip=0, core=..., addr= , size= ) at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
38 tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {} (gdb) info locals read_hex_vec = std::vector of length 1, capacity 1 = {0} (gdb) up
6 0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailboxaddress=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
429 run_mailbox_read_val = read_hex_vec_from_core( (gdb) info locals Python Exception <class 'gdb.error'> value has been optimized out: run_mailbox_read_val = cluster = @0x7ffc7b436e98: 0x64a0550 chip_id = @0x7ffc7b436e94: 0 core = @0x7ffc7b436f70: {x = 7, y = 1} cluster =
chip_id = core = run_mailbox_read_val = ok definitely a multi-core test, on 56 cores, got this from gdb:
(gdb) up
11 tt::tt_metal::Kernel::logical_cores (this=
) at tt_metal/impl/kernels/kernel.cpp:33 33 for (auto core_range : this->core_rangeset.ranges()) { (gdb) info locals core_range =
for_range = __for_begin = for_end =cores = std::set with 56 elements = {[0] = {x = 0, y = 0}, [1] = {x = 0, y = 1}, [2] = {x = 0, y = 2}, [3] = {x = 0, y = 3}, [4] = {x = 0, y = 4}, [5] = {x = 0, y = 5}, [6] = {x = 0, y = 6}, [7] = {x = 1, y = 0}, [8] = {x = 1, y = 1}, [9] = {x = 1, y = 2}, [10] = {x = 1, y = 3}, [11] = {x = 1, y = 4}, [12] = {x = 1, y = 5}, [13] = {x = 1, y = 6}, [14] = {x = 2, y = 0}, [15] = {x = 2, y = 1}, [16] = {x = 2, y = 2}, [17] = {x = 2, y = 3}, [18] = {x = 2, y = 4}, [19] = {x = 2, y = 5}, [20] = {x = 2, y = 6}, [21] = {x = 3, y = 0}, [22] = {x = 3, y = 1}, [23] = {x = 3, y = 2}, [24] = {x = 3, y = 3}, [25] = {x = 3, y = 4}, [26] = {x = 3, y = 5}, [27] = {x = 3, y = 6}, [28] = {x = 4, y = 0}, [29] = {x = 4, y = 1}, [30] = {x = 4, y = 2}, [31] = {x = 4, y = 3}, [32] = {x = 4, y = 4}, [33] = {x = 4, y = 5}, [34] = {x = 4, y = 6}, [35] = {x = 5, y = 0}, [36] = {x = 5, y = 1}, [37] = {x = 5, y = 2}, [38] = {x = 5, y = 3}, [39] = {x = 5, y = 4}, [40] = {x = 5, y = 5}, [41] = {x = 5, y = 6}, [42] = {x = 6, y = 0}, [43] = {x = 6, y = 1}, [44] = {x = 6, y = 2}, [45] = {x = 6, y = 3}, [46] = {x = 6, y = 4}, [47] = {x = 6, y = 5}, [48] = {x = 6, y = 6}, [49] = {x = 7, y = 0}, [50] = {x = 7, y = 1}, [51] = {x = 7, y = 2}, [52] = {x = 7, y = 3}, [53] = {x = 7, y = 4}, [54] = {x = 7, y = 5}, [55] = {x = 7, y = 6}} — Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/2691#issuecomment-1722179791, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNDXP5LEZS7L44B4YZTX2VRUVANCNFSM6AAAAAA42XTZMM . You are receiving this because you were mentioned.Message ID: @.***>
Trying to repro on my ird machine. I am not able to repro the hang 200 runs -- Machine configuration
I don't run warm-reset in between. -- Trying longer since it seems we crash on iter# 414 in standalone
Trying to repro on my ird machine. I am not able to repro the hang 200 runs -- Machine configuration
I don't run warm-reset in between. -- Trying longer since it seems we crash on iter# 414 in standalone
Do you know what the harvesting config on this machine is?
We are soft-harvesting for this machine:
harvesting: [
0: {noc_translation: True, harvest_mask: 1},
]
This is first tensix row or y=1 is harvested
An update, I managed to recreate the hang last night on row 3 on iteration 500 -- Which is odd since it does not match with the "harvested" row, which is tensix row 0 or noc physical y=1 in this case. Also, this machine is a soft-harvested configuration, so targeting the harvested row should not result in any erroneous behavior anyways
Always | INFO | Not Done Cores: {(x=1,y=3), (x=2,y=3), (x=3,y=3), (x=4,y=3), (x=6,y=3), (x=7,y=3), (x=8,y=3), (x=9,y=3)}
@davorchap
Interesting data-point: I added some telemetry + a small sleep (0.2s) before checking the core status again so I'm not blasting the log. As a result, I wasn't able to recreate the post-commit hang, currently at iteration 8 and still going strong. I think it is quite hard to get to the hang scenario when i perturb the timing. I'm rerunning without the sleep and seeing if I can still hit the hang for standalone.
I think there are few things I want to try.
Just posting here for visibility. Observed another hang on BM on a usually good machine:
Logs, still running but should time out soon: https://github.com/tenstorrent-metal/tt-metal/actions/runs/6256572531/job/16987607490
Stack trace
#0 0x00007fecbb961dc5 in std::_Rb_tree_iterator<std::pair<int const, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > > >::operator*() const ()
from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#1 0x00007fecbb92402c in std::map<int, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > > > >::at(int const&) () from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#2 0x00007fecbb90d35a in tt_SiliconDevice::address_in_tlb_space(unsigned int, unsigned int, int, unsigned int, unsigned int) () from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#3 0x00007fecbb9157c1 in tt_SiliconDevice::read_device_memory(unsigned int*, tt_cxy_pair, unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#4 0x00007fecbb91f908 in tt_SiliconDevice::read_from_device(unsigned int*, tt_cxy_pair, unsigned long, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#5 0x00007fecbc244e6a in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x55e1a408c850, dram_core=..., addr=16, size_in_bytes=4, small_access=<optimized out>) at /usr/include/c++/9/bits/char_traits.h:300
#6 0x00007fecbc257820 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>) at /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
#7 0x00007fecbc2586e8 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=16, __closure=0x55e1a4090830) at tt_metal/llrt/llrt.cpp:384
#8 std::_Function_handler<bool(long unsigned int), tt::llrt::internal_::check_if_riscs_on_specified_core_done(tt_cluster*, int, const CoreCoord&)::<lambda(uint64_t)> >::_M_invoke(const std::_Any_data &, unsigned long &&) (__functor=..., __args#0=<optimized out>)
at /usr/include/c++/9/bits/std_function.h:285
#9 0x00007fecbc258547 in std::function<bool (unsigned long)>::operator()(unsigned long) const (__args#0=<optimized out>, this=0x7fff80aba430) at /usr/include/c++/9/bits/std_function.h:683
#10 tt::llrt::internal_::check_if_riscs_on_specified_core_done (cluster=<optimized out>, chip_id=<optimized out>, core=...) at tt_metal/llrt/llrt.cpp:396
#11 0x00007fecbc1c5334 in tt::tt_metal::LaunchProgram (device=device@entry=0x55e1a3f13c40, program=..., stagger_start=stagger_start@entry=false) at tt_metal/tt_metal.cpp:519
#12 0x000055e1a2eba6a6 in tt::test::buffer::detail::SimpleTiledL1WriteCBRead (device=0x55e1a3f13c40, core=..., input_local_address=794624, intermed_local_address=<optimized out>, output_local_address=802816, byte_size=<optimized out>)
at tests/tt_metal/tt_metal/unit_tests/buffer/test_simple_l1_buffer.cpp:102
#13 0x000055e1a2ebc398 in SingleDeviceFixture_TestSimpleL1ReadWritex2y2TileLo_Test::TestBody (this=0x55e1a3cb7eb0) at ./tt_metal/third_party/umd/device/tt_xy_pair.h:13
#14 0x00007fecbbec925e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /usr/local/lib/libgtest.so.1.13.0
#15 0x00007fecbbec12bd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /usr/local/lib/libgtest.so.1.13.0
Ran this on Davor's t3005 and set it to run for 10K iterations, but forgot to log the iteration number and it hung. Launched again, and will report the number.
It does seem like fast dispatch improves stability, but it just delays the hang.
Reran on Davor's t3005. Made it to 683 iterations before reaching the hanging state.
FYI, this hang-signature looks similar to the hang related to UMD seen on BBE as well: https://yyz-gitlab.local.tenstorrent.com/tenstorrent/budabackend/-/issues/1931
@abhullar-tt is this relevant?
I believe @kkwong10 was root causing this issue and it appears to be similar to the hang identified in BBE, I don't think this has been solved in UMD yet.
@davorchap should we repeat this stress test?
@abhullar-tt any update on this?
@abhullar-tt any update on this?
No updates. I can kick off a stress test to see if this still shows up
This issue is quite stale and has not been actioned on in the past months. Can we adjust the priority down?
imho, the problem isn't that this is too high a priority it is that we aren't assigning a resource. undiagnosed hangs should be critical
I kicked off a stress test on WH a month ago but didn't se the same hang signature. We probably should spend more time confirming whether we still see it or not
WH stress test has been success for quite some time, is this test still part of post-commit suite?
I ran 50 iterations on WH on main (bf25587
) and seeing no hangs. I don't believe this is an issue anymore.
It hangs on the 6th iteration of post-commit stress test (machine:
t3005
)hanging waiting for program to finish: