tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
473 stars 75 forks source link

WH: Multi-core unpad hangs non-deterministically in post-commit #2691

Closed davorchap closed 9 months ago

davorchap commented 1 year ago

It hangs on the 6th iteration of post-commit stress test (machine: t3005)

===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 08:11:28.865 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 08:11:28.927 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 08:11:28.953 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 08:11:29.401 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 08:11:29.401 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}

hanging waiting for program to finish:

(gdb) bt
#0  0x00007fa78414be60 in memcpy_from_device(void*, void const*, unsigned long) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#1  0x00007fa78414c04a in read_block(TTDevice*, unsigned int, unsigned int, unsigned long, unsigned int) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#2  0x00007fa784155845 in tt_SiliconDevice::read_device_memory(unsigned int*, tt_cxy_pair, unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#3  0x00007fa78415f908 in tt_SiliconDevice::read_from_device(unsigned int*, tt_cxy_pair, unsigned long, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#4  0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
    at /usr/include/c++/9/bits/char_traits.h:300
#5  0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
    at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
#6  0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
#7  std::_Function_handler<bool(long unsigned int), tt::llrt::internal_::check_if_riscs_on_specified_core_done(tt_cluster*, int, tt::llrt::TensixRiscsOptions, const CoreCoord&)::<lambda(uint64_t)> >::_M_invoke(const std::_Any_data &, unsigned long &&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/9/bits/std_function.h:285
#8  0x00007fa7843fc437 in std::function<bool (unsigned long)>::operator()(unsigned long) const (__args#0=<optimized out>, this=0x7ffc7b436eb0) at /usr/include/c++/9/bits/std_function.h:683
#9  tt::llrt::internal_::check_if_riscs_on_specified_core_done (cluster=<optimized out>, chip_id=<optimized out>, riscs_options=<optimized out>, core=...) at tt_metal/llrt/llrt.cpp:441
#10 0x00007fa78436b4cf in tt::tt_metal::LaunchProgram (device=device@entry=0x6495380, program=..., stagger_start=stagger_start@entry=false) at tt_metal/tt_metal.cpp:508
#11 0x00007fa7847c9ff8 in tt::tt_metal::operation::detail::run_without_program_cache (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
    optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:125
#12 0x00007fa7847ce631 in tt::tt_metal::operation::run (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...}, optional_input_tensors=std::vector of length 0, capacity 0)
    at tt_eager/tt_dnn/op_library/run_operation.cpp:204
#13 0x00007fa7847cfcd1 in tt::tt_metal::operation::run_without_autoformat (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
    optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:237
#14 0x00007fa7846c9f9c in tt::tt_metal::operation::run_without_autoformat<tt::tt_metal::Unpad> (optional_input_tensors=std::vector of length 0, capacity 0,
    input_tensors=std::vector of length 1, capacity 1 = {...}, concrete_op=...) at /usr/include/c++/9/new:174
#15 tt::tt_metal::unpad (input_tensor_a=..., output_tensor_start=..., output_tensor_end=..., mem_config=...) at tt_eager/tt_dnn/op_library/unpad/unpad_op.cpp:130
#16 0x00007fa784597777 in tt::tt_metal::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>::operator() (__closure=<optimized out>, output_mem_config=..., output_tensor_end=..., output_tensor_start=..., input_tensor=...) at tt_eager/tt_lib/csrc/tt_lib_bindings.cpp:2272
#17 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call_impl<tt::tt_metal::Tensor, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&, 0, 1, 2, 3, pybind11::detail::void_type> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1443
#18 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call<tt::tt_metal::Tensor, pybind11::detail::void_type, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1412
#19 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator()(pybind11::detail::function_call &) (call=..., this=<optimized out>)
    at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:248
#20 0x00007fa7846173a6 in pybind11::cpp_function::dispatcher (self=0x6510840, args_in=0x7fa78496bcc0, kwargs_in=0x1) at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:939
davorchap commented 1 year ago

@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?

davorchap commented 1 year ago

@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?

@muthutt -- gdb is reporting that we're waiting on core {x = 7, y = 1} , so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.

(gdb) up
#4  0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
    at /usr/include/c++/9/bits/char_traits.h:300
300       { __c1 = __c2; }
(gdb) info locals
chip_id = <optimized out>
virtual_dram_core = {<tt_xy_pair> = {x = 7, y = 1}, chip = 0}
(gdb) up
#5  0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
    at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
38    tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {}
(gdb) info locals
read_hex_vec = std::vector of length 1, capacity 1 = {0}
(gdb) up
#6  0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
429         run_mailbox_read_val = read_hex_vec_from_core(
(gdb) info locals
Python Exception <class 'gdb.error'> value has been optimized out:
run_mailbox_read_val =
cluster = @0x7ffc7b436e98: 0x64a0550
chip_id = @0x7ffc7b436e94: 0
core = @0x7ffc7b436f70: {x = 7, y = 1}
cluster = <optimized out>
chip_id = <optimized out>
core = <optimized out>
run_mailbox_read_val = <optimized out>
davorchap commented 1 year ago

@tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?

@muthutt -- gdb is reporting that we're waiting on core {x = 7, y = 1} , so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.

(gdb) up
#4  0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
    at /usr/include/c++/9/bits/char_traits.h:300
300         { __c1 = __c2; }
(gdb) info locals
chip_id = <optimized out>
virtual_dram_core = {<tt_xy_pair> = {x = 7, y = 1}, chip = 0}
(gdb) up
#5  0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
    at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
38      tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {}
(gdb) info locals
read_hex_vec = std::vector of length 1, capacity 1 = {0}
(gdb) up
#6  0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
429           run_mailbox_read_val = read_hex_vec_from_core(
(gdb) info locals
Python Exception <class 'gdb.error'> value has been optimized out:
run_mailbox_read_val =
cluster = @0x7ffc7b436e98: 0x64a0550
chip_id = @0x7ffc7b436e94: 0
core = @0x7ffc7b436f70: {x = 7, y = 1}
cluster = <optimized out>
chip_id = <optimized out>
core = <optimized out>
run_mailbox_read_val = <optimized out>

ok definitely a multi-core test, on 56 cores, got this from gdb:

(gdb) up
#11 tt::tt_metal::Kernel::logical_cores (this=<optimized out>) at tt_metal/impl/kernels/kernel.cpp:33
33      for (auto core_range : this->core_range_set_.ranges()) {
(gdb) info locals
core_range = <optimized out>
__for_range = <optimized out>
__for_begin = <optimized out>
__for_end = <optimized out>
cores = std::set with 56 elements = {[0] = {x = 0, y = 0}, [1] = {x = 0, y = 1}, [2] = {x = 0, y = 2}, [3] = {x = 0, y = 3}, [4] = {x = 0, y = 4}, [5] = {x = 0, y = 5}, [6] = {x = 0,
    y = 6}, [7] = {x = 1, y = 0}, [8] = {x = 1, y = 1}, [9] = {x = 1, y = 2}, [10] = {x = 1, y = 3}, [11] = {x = 1, y = 4}, [12] = {x = 1, y = 5}, [13] = {x = 1, y = 6}, [14] = {x = 2,
    y = 0}, [15] = {x = 2, y = 1}, [16] = {x = 2, y = 2}, [17] = {x = 2, y = 3}, [18] = {x = 2, y = 4}, [19] = {x = 2, y = 5}, [20] = {x = 2, y = 6}, [21] = {x = 3, y = 0}, [22] = {x = 3,
    y = 1}, [23] = {x = 3, y = 2}, [24] = {x = 3, y = 3}, [25] = {x = 3, y = 4}, [26] = {x = 3, y = 5}, [27] = {x = 3, y = 6}, [28] = {x = 4, y = 0}, [29] = {x = 4, y = 1}, [30] = {x = 4,
    y = 2}, [31] = {x = 4, y = 3}, [32] = {x = 4, y = 4}, [33] = {x = 4, y = 5}, [34] = {x = 4, y = 6}, [35] = {x = 5, y = 0}, [36] = {x = 5, y = 1}, [37] = {x = 5, y = 2}, [38] = {x = 5,
    y = 3}, [39] = {x = 5, y = 4}, [40] = {x = 5, y = 5}, [41] = {x = 5, y = 6}, [42] = {x = 6, y = 0}, [43] = {x = 6, y = 1}, [44] = {x = 6, y = 2}, [45] = {x = 6, y = 3}, [46] = {x = 6,
    y = 4}, [47] = {x = 6, y = 5}, [48] = {x = 6, y = 6}, [49] = {x = 7, y = 0}, [50] = {x = 7, y = 1}, [51] = {x = 7, y = 2}, [52] = {x = 7, y = 3}, [53] = {x = 7, y = 4}, [54] = {x = 7,
    y = 5}, [55] = {x = 7, y = 6}}
davorchap commented 1 year ago

@kkwong10 and @abhullar-tt do we have any other multi-core tests in post-commit?

This is the first one I saw, and it hangs non-deterministically, it failed on the 6th iteration of post-commit.

tt-aho commented 1 year ago

Do we support multi-core tests now on WH? Or should I fixup and merge the single core mode changes I was working on previously? (When enabled and ops request grid size, it would return a 1x1 grid). Then we could just force all the op WH tests to set this env var until multi-core is more tested and selectively remove the flag? Might also help increase wh tests as currently wh tests are only specific shapes that would trigger single core (most are just single tile tests I think), but with the flag set you should be able to run with all the original unit test shapes unless there is some other bug.

davorchap commented 1 year ago

Do we support multi-core tests now on WH? Or should I fixup and merge the single core mode changes I was working on previously? (When enabled and ops request grid size, it would return a 1x1 grid). Then we could just force all the op WH tests to set this env var until multi-core is more tested and selectively remove the flag? Might also help increase wh tests as currently wh tests are only specific shapes that would trigger single core (most are just single tile tests I think), but with the flag set you should be able to run with all the original unit test shapes unless there is some other bug.

I think it would be great to have ability to force single-core. We'd still want to run multi-core, as of now there's no reason that they shouldn't work -- but there could be a bug and we need to de-couple single-core debug vs. multi-core debug.

We should do progressive WH testing:

davorchap commented 1 year ago

Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.

===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 09:37:50.214 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 09:37:50.268 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 09:37:50.294 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 09:37:50.775 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 09:37:50.776 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
abhullar-tt commented 1 year ago

@kkwong10 and @abhullar-tt do we have any other multi-core tests in post-commit?

This is the first one I saw, and it hangs non-deterministically, it failed on the 6th iteration of post-commit.

All of the c++ unit tests are single core

abhullar-tt commented 1 year ago

Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.

===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 09:37:50.214 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 09:37:50.268 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 09:37:50.294 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 09:37:50.775 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 09:37:50.776 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}

Ivan mentioned that physical coords + translation tables enabled should work on device but BBE hit an issue with this config. Could we be having the same problems since we aren't using the HW translated coordinates?

davorchap commented 1 year ago

Another hang, now after 8th iterations of post-commit. It looks to be the same configuration.

===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 09:37:50.214 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 09:37:50.268 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 09:37:50.294 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 09:37:50.775 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 09:37:50.776 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}

Ivan mentioned that physical coords + translation tables enabled should work on device but BBE hit an issue with this config. Could we be having the same problems since we aren't using the HW translated coordinates?

Perhaps translation is the issue -- or this is just a slow dispatch bug? If translation issue, would this test work at all -- it hangs only on the 8th iteration of post-commit.

Based on what Ivan was saying, using physical is fine, they go through the same HW table in front of the router, and physical range of coords is 1:1 in the table, and then the router works with physical coords anyway.

We should investigate this more before switch to translated.

Btw GS + slow dispatch is also hanging (a different single-core test). I talked to @DrJessop , and we're going to a straight fast dispatch port to WH, to rule out slow dispatch issues.

davorchap commented 1 year ago

I was able to repro unpad hang by running it by itself. It hung in 122nd iteration. Two main suspects are slow dispatch bug and harvesting. I will dig deeper to see which core is hanging-- and if there's a pattern to itl.

023-09-16 19:15:30.494 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:31.636 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-16 19:15:31.637 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:31.637 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - unpad test passed with input shape [[5, 5, 50, 50]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.
                  Metal | INFO     | Closing device 0

=========================================================================================================== PASSES ===========================================================================================================
================================================================================================== short test summary info ===================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]
===================================================================================================== 1 passed in 3.12s ======================================================================================================
                  Metal | INFO     | Closing device driver
==================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 1 item

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 19:15:34.737 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 19:15:34.799 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 19:15:34.825 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 19:15:35.260 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 19:15:35.261 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:36.425 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-16 19:15:36.425 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
2023-09-16 19:15:36.425 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - unpad test passed with input shape [[5, 5, 50, 50]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.
                  Metal | INFO     | Closing device 0

=========================================================================================================== PASSES ===========================================================================================================
================================================================================================== short test summary info ===================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]
===================================================================================================== 1 passed in 3.12s ======================================================================================================
                  Metal | INFO     | Closing device driver
==================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 1 item

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 19:15:39.517 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 19:15:39.574 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 19:15:39.599 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 19:15:40.024 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 19:15:40.025 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
davorchap commented 1 year ago

One interesting thing I caught in top. The slow dispatch test that is stuck polling the device is reported as using 100.3% of a CPU core. I wonder if the polling thread is being put to sleep and also moved around cores. Or if this is just a math noise/error by OS/top.

@pgkeller , just wondering if you saw something like this.

image
davorchap commented 1 year ago

Or 100.3% is because we have 79 threads, and some of them occasionally wake-up and use up the CPU?

Probably high chance that the polling thread is being moved across the cores, I wonder is that's interfering with polling of the device.

@TT-billteng I think it's worth a try running slow dispatch w/ 1 thread.

(gdb) info threads
  Id   Target Id                                    Frame
* 1    Thread 0x7f1e5390c740 (LWP 1155321) "pytest" 0x00007f1e35d7e4ba in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
  2    Thread 0x7f1e32d1e700 (LWP 1155323) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354582e0 <thread_status+96>) at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0x7f1e3251d700 (LWP 1155324) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458360 <thread_status+224>) at ../sysdeps/nptl/futex-internal.h:183
  4    Thread 0x7f1e31d1c700 (LWP 1155325) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354583e0 <thread_status+352>) at ../sysdeps/nptl/futex-internal.h:183
  5    Thread 0x7f1e3151b700 (LWP 1155326) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458460 <thread_status+480>) at ../sysdeps/nptl/futex-internal.h:183
  6    Thread 0x7f1e30d1a700 (LWP 1155327) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354584e0 <thread_status+608>) at ../sysdeps/nptl/futex-internal.h:183
  7    Thread 0x7f1e28519700 (LWP 1155328) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458560 <thread_status+736>) at ../sysdeps/nptl/futex-internal.h:183
  8    Thread 0x7f1e27d18700 (LWP 1155329) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354585e0 <thread_status+864>) at ../sysdeps/nptl/futex-internal.h:183
  9    Thread 0x7f1e27517700 (LWP 1155330) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458660 <thread_status+992>) at ../sysdeps/nptl/futex-internal.h:183
  10   Thread 0x7f1e24d16700 (LWP 1155331) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354586e0 <thread_status+1120>) at ../sysdeps/nptl/futex-internal.h:183
  11   Thread 0x7f1e24515700 (LWP 1155332) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458760 <thread_status+1248>) at ../sysdeps/nptl/futex-internal.h:183
  12   Thread 0x7f1e1fd14700 (LWP 1155333) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354587e0 <thread_status+1376>) at ../sysdeps/nptl/futex-internal.h:183
  13   Thread 0x7f1e19513700 (LWP 1155334) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458860 <thread_status+1504>) at ../sysdeps/nptl/futex-internal.h:183
  14   Thread 0x7f1e16d12700 (LWP 1155335) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354588e0 <thread_status+1632>) at ../sysdeps/nptl/futex-internal.h:183
  15   Thread 0x7f1e14511700 (LWP 1155336) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458960 <thread_status+1760>) at ../sysdeps/nptl/futex-internal.h:183
  16   Thread 0x7f1e13d10700 (LWP 1155337) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354589e0 <thread_status+1888>) at ../sysdeps/nptl/futex-internal.h:183
  17   Thread 0x7f1e1350f700 (LWP 1155338) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458a60 <thread_status+2016>) at ../sysdeps/nptl/futex-internal.h:183
  18   Thread 0x7f1e10d0e700 (LWP 1155339) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ae0 <thread_status+2144>) at ../sysdeps/nptl/futex-internal.h:183
  19   Thread 0x7f1e1050d700 (LWP 1155340) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458b60 <thread_status+2272>) at ../sysdeps/nptl/futex-internal.h:183
  20   Thread 0x7f1e0fd0c700 (LWP 1155341) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458be0 <thread_status+2400>) at ../sysdeps/nptl/futex-internal.h:183
  21   Thread 0x7f1e0f50b700 (LWP 1155342) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458c60 <thread_status+2528>) at ../sysdeps/nptl/futex-internal.h:183
  22   Thread 0x7f1e0ed0a700 (LWP 1155343) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ce0 <thread_status+2656>) at ../sysdeps/nptl/futex-internal.h:183
  23   Thread 0x7f1e0e509700 (LWP 1155344) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458d60 <thread_status+2784>) at ../sysdeps/nptl/futex-internal.h:183
  24   Thread 0x7f1e0dd08700 (LWP 1155345) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458de0 <thread_status+2912>) at ../sysdeps/nptl/futex-internal.h:183
  25   Thread 0x7f1e0d507700 (LWP 1155346) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458e60 <thread_status+3040>) at ../sysdeps/nptl/futex-internal.h:183
  26   Thread 0x7f1e0cd06700 (LWP 1155347) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458ee0 <thread_status+3168>) at ../sysdeps/nptl/futex-internal.h:183
  27   Thread 0x7f1e0c505700 (LWP 1155348) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458f60 <thread_status+3296>) at ../sysdeps/nptl/futex-internal.h:183
  28   Thread 0x7f1e0bd04700 (LWP 1155349) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35458fe0 <thread_status+3424>) at ../sysdeps/nptl/futex-internal.h:183
  29   Thread 0x7f1e0b503700 (LWP 1155350) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35459060 <thread_status+3552>) at ../sysdeps/nptl/futex-internal.h:183
  30   Thread 0x7f1e0ad02700 (LWP 1155351) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354590e0 <thread_status+3680>) at ../sysdeps/nptl/futex-internal.h:183
  31   Thread 0x7f1e0a501700 (LWP 1155352) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e35459160 <thread_status+3808>) at ../sysdeps/nptl/futex-internal.h:183
  32   Thread 0x7f1e09d00700 (LWP 1155353) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f1e354591e0 <thread_status+3936>) at ../sysdeps/nptl/futex-internal.h:183
  33   Thread 0x7f1dbbfb4700 (LWP 1155548) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  34   Thread 0x7f1dbc7b5700 (LWP 1155549) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  35   Thread 0x7f1dbcfb6700 (LWP 1155550) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  36   Thread 0x7f1d3bfff700 (LWP 1155551) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  37   Thread 0x7f1dbb7b3700 (LWP 1155552) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  38   Thread 0x7f1dbafb2700 (LWP 1155553) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  39   Thread 0x7f1dba7b1700 (LWP 1155554) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  40   Thread 0x7f1db9fb0700 (LWP 1155555) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  41   Thread 0x7f1db97af700 (LWP 1155556) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  42   Thread 0x7f1db8fae700 (LWP 1155557) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  43   Thread 0x7f1db3fff700 (LWP 1155558) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  44   Thread 0x7f1db37fe700 (LWP 1155559) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  45   Thread 0x7f1db2ffd700 (LWP 1155560) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  46   Thread 0x7f1db27fc700 (LWP 1155561) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  47   Thread 0x7f1db1ffb700 (LWP 1155562) "pytest" 0x00007f1e5141fb31 in ?? () from /home/dcapalija/tt-metal/build/python_env/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  48   Thread 0x7f1db17fa700 (LWP 1155760) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb1a8) at ../sysdeps/nptl/futex-internal.h:183
  49   Thread 0x7f1db0ff9700 (LWP 1155761) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb218) at ../sysdeps/nptl/futex-internal.h:183
  50   Thread 0x7f1d08ff9700 (LWP 1155762) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb288) at ../sysdeps/nptl/futex-internal.h:183
  51   Thread 0x7f1d0bfff700 (LWP 1155763) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb2f8) at ../sysdeps/nptl/futex-internal.h:183
  52   Thread 0x7f1d3b7fe700 (LWP 1155764) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb368) at ../sysdeps/nptl/futex-internal.h:183
  53   Thread 0x7f1d3affd700 (LWP 1155765) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb3d8) at ../sysdeps/nptl/futex-internal.h:183
  54   Thread 0x7f1d3a7fc700 (LWP 1155766) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb448) at ../sysdeps/nptl/futex-internal.h:183
  55   Thread 0x7f1d39ffb700 (LWP 1155767) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb4b8) at ../sysdeps/nptl/futex-internal.h:183
  56   Thread 0x7f1d397fa700 (LWP 1155768) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb528) at ../sysdeps/nptl/futex-internal.h:183
  57   Thread 0x7f1d38ff9700 (LWP 1155769) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb598) at ../sysdeps/nptl/futex-internal.h:183
  58   Thread 0x7f1d13fff700 (LWP 1155770) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb608) at ../sysdeps/nptl/futex-internal.h:183
  59   Thread 0x7f1d137fe700 (LWP 1155771) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb678) at ../sysdeps/nptl/futex-internal.h:183
  60   Thread 0x7f1d12ffd700 (LWP 1155772) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb6e8) at ../sysdeps/nptl/futex-internal.h:183
  61   Thread 0x7f1d127fc700 (LWP 1155773) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb758) at ../sysdeps/nptl/futex-internal.h:183
  62   Thread 0x7f1d11ffb700 (LWP 1155774) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb7c8) at ../sysdeps/nptl/futex-internal.h:183
  63   Thread 0x7f1d117fa700 (LWP 1155775) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb838) at ../sysdeps/nptl/futex-internal.h:183
  64   Thread 0x7f1d10ff9700 (LWP 1155776) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb8a8) at ../sysdeps/nptl/futex-internal.h:183
  65   Thread 0x7f1d0b7fe700 (LWP 1155777) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb918) at ../sysdeps/nptl/futex-internal.h:183
  66   Thread 0x7f1d0affd700 (LWP 1155778) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb988) at ../sysdeps/nptl/futex-internal.h:183
  67   Thread 0x7f1d0a7fc700 (LWP 1155779) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfb9f8) at ../sysdeps/nptl/futex-internal.h:183
  68   Thread 0x7f1d09ffb700 (LWP 1155780) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfba68) at ../sysdeps/nptl/futex-internal.h:183
  69   Thread 0x7f1d097fa700 (LWP 1155781) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbad8) at ../sysdeps/nptl/futex-internal.h:183
--Type <RET> for more, q to quit, c to continue without paging-- c
  70   Thread 0x7f1d03fff700 (LWP 1155782) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbb48) at ../sysdeps/nptl/futex-internal.h:183
  71   Thread 0x7f1d037fe700 (LWP 1155783) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbbb8) at ../sysdeps/nptl/futex-internal.h:183
  72   Thread 0x7f1d02ffd700 (LWP 1155784) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbc28) at ../sysdeps/nptl/futex-internal.h:183
  73   Thread 0x7f1d027fc700 (LWP 1155785) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbc98) at ../sysdeps/nptl/futex-internal.h:183
  74   Thread 0x7f1d01ffb700 (LWP 1155786) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbd08) at ../sysdeps/nptl/futex-internal.h:183
  75   Thread 0x7f1d017fa700 (LWP 1155787) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbd78) at ../sysdeps/nptl/futex-internal.h:183
  76   Thread 0x7f1d00ff9700 (LWP 1155788) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbde8) at ../sysdeps/nptl/futex-internal.h:183
  77   Thread 0x7f1cebfff700 (LWP 1155789) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbe5c) at ../sysdeps/nptl/futex-internal.h:183
  78   Thread 0x7f1ceb7fe700 (LWP 1155790) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbecc) at ../sysdeps/nptl/futex-internal.h:183
  79   Thread 0x7f1ceaffd700 (LWP 1155791) "pytest" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x4cfbf38) at ../sysdeps/nptl/futex-internal.h:183
(gdb)
davorchap commented 1 year ago

Using the new polling scheme I got this data: the entire row y=2 hangs (8-core) on a 56 core unpad, so very likely a harvesting bug.

It's a non-deterministic hang, as expected from a harvesting bug, it hangs on 144th iteration of this test.

This PR has the new polling scheme: https://github.com/tenstorrent-metal/tt-metal/pull/2696 We need to review our harvesting.

Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
davorchap commented 1 year ago

The cores are polled in the order shown below, as they happen to be in set / RB tree. This actually helped, because we effectively poll in "random" order, to exclude polling core order from a potential suspect.

Out of this random order emerges row y=2 as the hanging row.

Not done cores: (x=7,y=4) (x=5,y=5) (x=3,y=6) (x=6,y=4) (x=4,y=5) (x=2,y=6) (x=7,y=5) (x=5,y=4) (x=1,y=6) (x=7,y=6) (x=3,y=4) (x=1,y=5) (x=5,y=6) (x=3,y=5) (x=1,y=4) (x=7,y=0) (x=5,y=1) (x=3,y=2) (x=1,y=3) (x=7,y=1) (x=5,y=0) (x=3,y=3) (x=1,y=2) (x=7,y=2) (x=5,y=3) (x=3,y=0) (x=1,y=1) (x=7,y=3) (x=5,y=2) (x=3,y=1) (x=1,y=0) (x=6,y=5) (x=4,y=4) (x=0,y=6) (x=6,y=6) (x=2,y=4) (x=0,y=5) (x=4,y=6) (x=2,y=5) (x=0,y=4) (x=6,y=0) (x=4,y=1) (x=2,y=2) (x=0,y=3) (x=6,y=1) (x=4,y=0) (x=2,y=3) (x=0,y=2) (x=6,y=2) (x=4,y=3) (x=2,y=0) (x=0,y=1) (x=6,y=3) (x=4,y=2) (x=2,y=1) (x=0,y=0)
davorchap commented 1 year ago

was able to repro the same hang again on 71st iteration:

Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
TT-billteng commented 1 year ago

great find!

davorchap commented 1 year ago

same hang signature on 2 different N300 machines: t3005 and t3002.

was able to repro the exact same hang on t3002, it hung on 832nd iteration of the test. it's interesting the exact same "logical row" hangs. will dig into the physical coordinates next.

Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
Not done cores: (x=3,y=2) (x=1,y=2) (x=7,y=2) (x=5,y=2) (x=2,y=2) (x=0,y=2) (x=6,y=2) (x=4,y=2)
davorchap commented 1 year ago

on t3005:

(gdb) print core
$8 = (const tt_xy_pair &) @0x5dc2788: {x = 5, y = 2}
(gdb) call device->worker_core_from_logical_core(core)
$9 = {x = 7, y = 3}

on t3002 (same mapping):

hanging_core, logical = (x=5,y=2), routing = (x=7,y=3)
davorchap commented 1 year ago

as per @abhullar-tt suggestion, I added prints to row extraction function. based on the print and this Aditya's comment: // For Wormhole, we always remove the last few rows in the SOC descriptor in case of harvesting on N300 (2-row harvesting) UMD will just remove last 2 two rows (10,11) from the SOC descriptor, which means the list of worker cores in the SOC from UMD's perspective is "virtual" , ie it won't take out actually harvested rows. the actual harvested rows do seem to be removed on GS. On WH removal is virtual.

See examples on t3002, t3005, UMD just removes 10/11 , but different rows are harvested.

std::vector<int> tt_SiliconDevice::extract_rows_to_remove(const tt::ARCH &arch, const int worker_grid_rows, const int harvested_rows) {
    // Check if harvesting config is legal for GS and WH
    tt_device_logger::log_assert(!((harvested_rows & 1) || (harvested_rows & 64) || (harvested_rows & 0xFFFFF000)), "For grayskull and wormhole, only rows 1-5 and 7-11 can be harvested");
    std::vector<int> row_coordinates_to_remove;
    int row_coordinate = 0;
    int tmp = harvested_rows;
    while (tmp) {
        if (tmp & 1)
            row_coordinates_to_remove.push_back(row_coordinate);

        tmp = tmp >> 1;
        row_coordinate++;
    }
    std::cerr << "Rows to remove: ";
    if (arch == tt::ARCH::WORMHOLE || arch == tt::ARCH::WORMHOLE_B0) {
        // For Wormhole, we always remove the last few rows in the SOC descriptor in case of harvesting
        for (int i = 0; i < row_coordinates_to_remove.size(); i++) {
            std::cout << "(extracted = " << row_coordinates_to_remove[i] << ", ";
            row_coordinates_to_remove[i] = worker_grid_rows - i;
            std::cout << "what_UMD_removes = " << row_coordinates_to_remove[i] << ")  ";
        }
    }
    std::cerr << std::endl;
    return row_coordinates_to_remove;
}

from wormhole_b0_8x10.yaml , same is in wormhome_b0_80.yaml UMD will always remove the last two rows.

functional_workers:
  [
   1-1,   2-1,   3-1,   4-1,   6-1,   7-1,   8-1,   9-1, 
   1-2,   2-2,   3-2,   4-2,   6-2,   7-2,   8-2,   9-2, 
   1-3,   2-3,   3-3,   4-3,   6-3,   7-3,   8-3,   9-3, 
   1-4,   2-4,   3-4,   4-4,   6-4,   7-4,   8-4,   9-4, 
   1-5,   2-5,   3-5,   4-5,   6-5,   7-5,   8-5,   9-5, 
   1-7,   2-7,   3-7,   4-7,   6-7,   7-7,   8-7,   9-7, 
   1-8,   2-8,   3-8,   4-8,   6-8,   7-8,   8-8,   9-8, 
   1-9,   2-9,   3-9,   4-9,   6-9,   7-9,   8-9,   9-9, 
   1-10,  2-10,  3-10,  4-10,  6-10,  7-10,  8-10,  9-10, 
   1-11,  2-11,  3-11,  4-11,  6-11,  7-11,  8-11,  9-11, 
  ]
davorchap commented 1 year ago

Cores that we're polling ont3005:

Not done cores (logical): (x=7,y=4) (x=5,y=5) (x=3,y=6) (x=6,y=4) (x=4,y=5) (x=2,y=6) (x=7,y=5) (x=5,y=4) (x=1,y=6) (x=7,y=6) (x=3,y=4) (x=1,y=5) (x=5,y=6) (x=3,y=5) (x=1,y=4) (x=7,y=0) (x=5,y=1) (x=3,y=2) (x=1,y=3) (x=7,y=1) (x=5,y=0) (x=3,y=3) (x=1,y=2) (x=7,y=2) (x=5,y=3) (x=3,y=0) (x=1,y=1) (x=7,y=3) (x=5,y=2) (x=3,y=1) (x=1,y=0) (x=6,y=5) (x=4,y=4) (x=0,y=6) (x=6,y=6) (x=2,y=4) (x=0,y=5) (x=4,y=6) (x=2,y=5) (x=0,y=4) (x=6,y=0) (x=4,y=1) (x=2,y=2) (x=0,y=3) (x=6,y=1) (x=4,y=0) (x=2,y=3) (x=0,y=2) (x=6,y=2) (x=4,y=3) (x=2,y=0) (x=0,y=1) (x=6,y=3) (x=4,y=2) (x=2,y=1) (x=0,y=0)

Not done cores (physical routing): (x=9,y=5) (x=7,y=7) (x=4,y=9) (x=8,y=5) (x=6,y=7) (x=3,y=9) (x=9,y=7) (x=7,y=5) (x=2,y=9) (x=9,y=9) (x=4,y=5) (x=2,y=7) (x=7,y=9) (x=4,y=7) (x=2,y=5) (x=9,y=1) (x=7,y=2) (x=4,y=3) (x=2,y=4) (x=9,y=2) (x=7,y=1) (x=4,y=4) (x=2,y=3) (x=9,y=3) (x=7,y=4) (x=4,y=1) (x=2,y=2) (x=9,y=4) (x=7,y=3) (x=4,y=2) (x=2,y=1) (x=8,y=7) (x=6,y=5) (x=1,y=9) (x=8,y=9) (x=3,y=5) (x=1,y=7) (x=6,y=9) (x=3,y=7) (x=1,y=5) (x=8,y=1) (x=6,y=2) (x=3,y=3) (x=1,y=4) (x=8,y=2) (x=6,y=1) (x=3,y=4) (x=1,y=3) (x=8,y=3) (x=6,y=4) (x=3,y=1) (x=1,y=2) (x=8,y=4) (x=6,y=3) (x=3,y=2) (x=1,y=1)

Not done cores (UMD virtual): (chip=0,x=9,y=5) (chip=0,x=7,y=7) (chip=0,x=4,y=8) (chip=0,x=8,y=5) (chip=0,x=6,y=7) (chip=0,x=3,y=8) (chip=0,x=9,y=7) (chip=0,x=7,y=5) (chip=0,x=2,y=8) (chip=0,x=9,y=8) (chip=0,x=4,y=5) (chip=0,x=2,y=7) (chip=0,x=7,y=8) (chip=0,x=4,y=7) (chip=0,x=2,y=5) (chip=0,x=9,y=1) (chip=0,x=7,y=2) (chip=0,x=4,y=3) (chip=0,x=2,y=4) (chip=0,x=9,y=2) (chip=0,x=7,y=1) (chip=0,x=4,y=4) (chip=0,x=2,y=3) (chip=0,x=9,y=3) (chip=0,x=7,y=4) (chip=0,x=4,y=1) (chip=0,x=2,y=2) (chip=0,x=9,y=4) (chip=0,x=7,y=3) (chip=0,x=4,y=2) (chip=0,x=2,y=1) (chip=0,x=8,y=7) (chip=0,x=6,y=5) (chip=0,x=1,y=8) (chip=0,x=8,y=8) (chip=0,x=3,y=5) (chip=0,x=1,y=7) (chip=0,x=6,y=8) (chip=0,x=3,y=7) (chip=0,x=1,y=5) (chip=0,x=8,y=1) (chip=0,x=6,y=2) (chip=0,x=3,y=3) (chip=0,x=1,y=4) (chip=0,x=8,y=2) (chip=0,x=6,y=1) (chip=0,x=3,y=4) (chip=0,x=1,y=3) (chip=0,x=8,y=3) (chip=0,x=6,y=4) (chip=0,x=3,y=1) (chip=0,x=1,y=2) (chip=0,x=8,y=4) (chip=0,x=6,y=3) (chip=0,x=3,y=2) (chip=0,x=1,y=1)

Rows 8, 10, 11 are removed from the physical routing list (worker_core_from_logical_core). 8, 11 presumably based on the extracted harvesting info, and then 10 is for storage/dispatch.

Virtual list has 9, 10, 11 removed. 10/11 for harvesting and then 9 for storage/dispatch.

So far in these printouts I don't see a harvesting bug. But it is very sus that we hang we always hang on that same row, on both t3002 / t3005. I was able to repro it ~5 times so far.

Also printed cores for which we're setting up static TLBs. This seem be done via UMD virtual coords: 1-5, 7-9 for workers, and then 0, 6 added for ETH. I assume this is what the UMD expects, tlbs to be programmed via virtual coords, as tlbs are looked up via virtual coords.

setting up TLB for cores:
(x=1,y=1) (x=2,y=1) (x=3,y=1) (x=4,y=1) (x=6,y=1) (x=7,y=1) (x=8,y=1) (x=9,y=1) (x=1,y=2) (x=2,y=2) (x=3,y=2) (x=4,y=2) (x=6,y=2) (x=7,y=2) (x=8,y=2) (x=9,y=2) (x=1,y=3) (x=2,y=3) (x=3,y=3) (x=4,y=3) (x=6,y=3) (x=7,y=3) (x=8,y=3) (x=9,y=3) (x=1,y=4) (x=2,y=4) (x=3,y=4) (x=4,y=4) (x=6,y=4) (x=7,y=4) (x=8,y=4) (x=9,y=4) (x=1,y=5) (x=2,y=5) (x=3,y=5) (x=4,y=5) (x=6,y=5) (x=7,y=5) (x=8,y=5) (x=9,y=5) (x=1,y=7) (x=2,y=7) (x=3,y=7) (x=4,y=7) (x=6,y=7) (x=7,y=7) (x=8,y=7) (x=9,y=7) (x=1,y=8) (x=2,y=8) (x=3,y=8) (x=4,y=8) (x=6,y=8) (x=7,y=8) (x=8,y=8) (x=9,y=8) (x=1,y=9) (x=2,y=9) (x=3,y=9) (x=4,y=9) (x=6,y=9) (x=7,y=9) (x=8,y=9) (x=9,y=9) (x=9,y=0) (x=1,y=0) (x=8,y=0) (x=2,y=0) (x=7,y=0) (x=3,y=0) (x=6,y=0) (x=4,y=0) (x=9,y=6) (x=1,y=6) (x=8,y=6) (x=2,y=6) (x=7,y=6) (x=3,y=6) (x=6,y=6) (x=4,y=6)

@abhullar-tt @kkwong10 we're doing std::int32_t get_static_tlb_index(CoreCoord target) in tt_cluster.cpp , and all of TLB maps use virtual UMD coords. I wonder if there's something to do with which TLB we use to poll / and which one should be used? Below it suggests that there is a 2MB polling TLB, and that TLB setup is aligned with KMD's WC/UC split.

 // MEM_*_TLB are for dynamic read/writes to memory, either 16MB (large read/writes) or 2MB (polling). REG_TLB for dynamic writes
    // to registers.   They are aligned with the kernel driver's WC/UC split.  But kernel driver uses different TLB's for these.
    static constexpr unsigned int REG_TLB                   = TLB_BASE_INDEX_16M + 18;
    static constexpr unsigned int MEM_LARGE_WRITE_TLB       = TLB_BASE_INDEX_16M + 17;
    static constexpr unsigned int MEM_LARGE_READ_TLB        = TLB_BASE_INDEX_16M + 0;
    static constexpr unsigned int MEM_SMALL_READ_WRITE_TLB  = TLB_BASE_INDEX_2M + 1;
    static constexpr uint32_t DYNAMIC_TLB_BASE_INDEX = MEM_LARGE_READ_TLB + 1;
    static constexpr uint32_t INTERNAL_TLB_INDEX = DYNAMIC_TLB_BASE_INDEX + DYNAMIC_TLB_COUNT; // pcie_write_xy and similar
    static constexpr uint32_t DRAM_CHANNEL_0_X = 0;
    static constexpr uint32_t DRAM_CHANNEL_0_Y = 0;
    static constexpr uint32_t DRAM_CHANNEL_0_PEER2PEER_REGION_START = 0x30000000; // This is the last 256MB of DRAM
davorchap commented 1 year ago

@abhullar-tt @kkwong10 @pgkeller @DrJessop -- lots of interesting data. Although I wouldn't rule out harvesting, what's really sus is the exact same row is hanging on 2 N300's , and they have different rows harvested, and the row that's hanging is not harvested. So what is regular and setup the same across device -- one thing I can think of is TLBs, we have static and 4 dynamic types (LARGE_READ_TLB , LARGE_WRITE_TLB, SMALL_READ_WRITE_TLB, REG_TLB) and different TLBs have different WC/UC mappings. My suspicion is that any resets (reg writes), and "go" signals , any "polling" should be done via UC writes/reads. This doesn't seem to be the case today.

1) Writing a reset via "LARGE_WRITE_TLB" is the one we use currently, and the only one I was able to try, I get a crash for every other option, so I could't really see what the effect of other types of TLBs be.

  //device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "REG_TLB"); // crash
        //device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "SMALL_READ_WRITE_TLB"); // crash
        device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 

2) Reseting the full device results in mis-matches. Are we still downloading blanks to un-used cores / storage cores? I think we should get this to work, for several reasons:

3) Writing to the reset reg 2 times in a row to the same core results in very frequent hangs -- and number of cores that hang varies, it can be 20 - 50. Writing to this reg with the de-assert mask starts BRISC but keeps in reset NCRISC/TRISC, so I'm guessing is that the first write kick-off BRISC, and then BRISC itself kicks off NCRISC, TRISC, but then the 2nd write comes and asserts reset on NCRISC / TRISC so they never finish. The variability in timing is interesting, because sometimes these arrive close enough that there is no hang.

device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_sfence();

       // device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_sfence();

4) Polling the reset register to check if the value was written hang (unless there's a bug in this code). This is unexpected, it would be great to be able to do this, to confirm that BRISC on a core did come out of reset. I hope this can be made to work.

        tt_cxy_pair virtual_core = this->convert_physical_cxy_to_virtual(core);
        device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_mfence();

        std::vector<uint32_t> vec_read(1);
        vec_read[0] = 0xfacafaca;
        while (vec_read[0] != vec[0]) {
            device->read_from_device(vec_read.data(), virtual_core,  0xFFB121B0, vec.size(), "LARGE_READ_TLB"); // hangs
        }

5) based on experiment 3, (back to back writes to reset) -- I made the 2nd write just send a 0. This should mean don't reset any RISCV, so my expectation is a No-op. The test passed once, and then 2nd time I get a 0 in the run mailbox when checking done. That is very sus, before de-asserting resets we write the mailbox and check its value.

Read unexpected run_mailbox value: 0 (expected 2a or 1)
abhullar-tt commented 1 year ago

@abhullar-tt @kkwong10 @pgkeller @DrJessop -- lots of interesting data. Although I wouldn't rule out harvesting, what's really sus is the exact same row is hanging on 2 N300's , and they have different rows harvested, and the row that's hanging is not harvested. So what is regular and setup the same across device -- one thing I can think of is TLBs, we have static and 4 dynamic types (LARGE_READ_TLB , LARGE_WRITE_TLB, SMALL_READ_WRITE_TLB, REG_TLB) and different TLBs have different WC/UC mappings. My suspicion is that any resets (reg writes), and "go" signals , any "polling" should be done via UC writes/reads. This doesn't seem to be the case today.

  1. Writing a reset via "LARGE_WRITE_TLB" is the one we use currently, and the only one I was able to try, I get a crash for every other option, so I could't really see what the effect of other types of TLBs be.
  //device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "REG_TLB"); // crash
        //device->write_to_device(&vec[0], vec.size(), virtual_core, 0xFFB121B0, "SMALL_READ_WRITE_TLB"); // crash
        device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 

UMD constructor (see cluster->open_device) accepts dynamic TLB config as a param (string -> int map). By default the constructor adds:

    // It is mandatory for all devices to have these TLBs set aside, as the driver needs them to issue remote reads and writes.
    dynamic_tlb_config["LARGE_READ_TLB"] =  DEVICE_DATA.MEM_LARGE_READ_TLB;
    dynamic_tlb_config["LARGE_WRITE_TLB"] = DEVICE_DATA.MEM_LARGE_WRITE_TLB;

In this experiment was the dynamic tlb config uplifted? If we want to use it then we should want to use REG_TLB and MEM_SMALL_READ_WRITE_TLB in open device we should uplift:

        uint32_t num_host_mem_ch_per_mmio_device = 1;
        std::unordered_map<std::string, std::int32_t> dynamic_tlb_config = {
            {"REG_TLB", DEVICE_DATA.REG_TLB},
            {"SMALL_READ_WRITE_TLB", DEVICE_DATA.MEM_SMALL_READ_WRITE_TLB}
        };
        // This will remove harvested rows from the soc descriptor
        const bool perform_harvesting = true;

        device = std::make_unique<tt_SiliconDevice>(sdesc_path, ndesc_path, target_device_ids, num_host_mem_ch_per_mmio_device, dynamic_tlb_config, skip_driver_allocs, perform_harvesting);

During initial port to open-UMD we had decided to use the default.

  1. Reseting the full device results in mis-matches. Are we still downloading blanks to un-used cores / storage cores? I think we should get this to work, for several reasons:

    • in UMD it's actually implemented using REG_TLB and a full chip broadcast. I think REG_TLB matters (UC?), broadcast may not matter
    • this is what BBE uses
    • this is what metal used and then we changed to LARGE_WRITE_TLB
    • so we went from "REG_TLB + broadcast" to "LARGE_WRITE_TLB + unicast"
   //cluster->deassert_risc_reset(device_id, false);
// which calls
//         device->deassert_risc_reset(target_device_id);

We are writing blanks to all un-used cores but not to storage only cores. Was this modified in launch kernels? Maybe the hang here is because we aren't taking all of the cores out of reset anymore (only the ones used in program)?

  1. Writing to the reset reg 2 times in a row to the same core results in very frequent hangs -- and number of cores that hang varies, it can be 20 - 50. Writing to this reg with the de-assert mask starts BRISC but keeps in reset NCRISC/TRISC, so I'm guessing is that the first write kick-off BRISC, and then BRISC itself kicks off NCRISC, TRISC, but then the 2nd write comes and asserts reset on NCRISC / TRISC so they never finish. The variability in timing is interesting, because sometimes these arrive close enough that there is no hang.
device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_sfence();

       // device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_sfence();
  1. Polling the reset register to check if the value was written hang (unless there's a bug in this code). This is unexpected, it would be great to be able to do this, to confirm that BRISC on a core did come out of reset. I hope this can be made to work.
        tt_cxy_pair virtual_core = this->convert_physical_cxy_to_virtual(core);
        device->write_to_device(vec.data(), vec.size(), virtual_core, 0xFFB121B0, "LARGE_WRITE_TLB"); 
        _mm_mfence();

        std::vector<uint32_t> vec_read(1);
        vec_read[0] = 0xfacafaca;
        while (vec_read[0] != vec[0]) {
            device->read_from_device(vec_read.data(), virtual_core,  0xFFB121B0, vec.size(), "LARGE_READ_TLB"); // hangs
        }

size to pass into read_from_device should be vec.size() * sizeof(uint32_t)

  1. based on experiment 3, (back to back writes to reset) -- I made the 2nd write just send a 0. This should mean don't reset any RISCV, so my expectation is a No-op. The test passed once, and then 2nd time I get a 0 in the run mailbox when checking done. That is very sus, before de-asserting resets we write the mailbox and check its value.
Read unexpected run_mailbox value: 0 (expected 2a or 1)
davorchap commented 1 year ago

The same test hangs on the latest main after fw-at-init changes, hash = 2229e00055af18b999a85df895bc94af098cc3d1 I haven't ported the the polling simplification + debug hanging core change yet -- so I don't have the hang pattern data yet.

It hung on the 4th iteration post-commit stress, it does seem it's easier to make it hang as part of post-commit. It crashed on the 414th iteration of a stand-alone test run -- not sure if related, filed a separate case: https://github.com/tenstorrent-metal/tt-metal/issues/2700

Posting a longer test trace to see what ran just before, in case the system state turns out to be relevant. I'm guess it matters since 4th iteration of post-commit vs. 414th standalone.

2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:06.335 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-18 06:08:06.393 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:06.394 | WARNING  | SiliconDriver   - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 156649 tid: 140524614661952)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:06.418 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.912 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.913 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.913 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-wh test passed with input shape [[1, 1, 32, 32]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0] 2023-09-18 06:08:07.918 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:07.918 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.920 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-cn test passed with input shape [[1, 1, 32, 32]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes1] SKIPPED (not working for Wormhole B0)                  Metal | INFO     | Closing device 0

=============================================================================================================== PASSES ================================================================================================================
======================================================================================================= short test summary info =======================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0]
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0]
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:49: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:98: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:121: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:144: not working for Wormhole B0
==================================================================================================== 2 passed, 8 skipped in 3.02s =====================================================================================================
                  Metal | INFO     | Closing device driver
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:10.968 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-18 06:08:11.025 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:11.025 | WARNING  | SiliconDriver   - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 157310 tid: 140332769195840)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:11.050 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-18 06:08:11.476 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-18 06:08:11.476 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
muthutt commented 1 year ago

I see the debug strategy; thanks @Davor Capalija @.***>

On Sat, Sep 16, 2023 at 1:48 AM Davor Capalija @.***> wrote:

@tt-aho https://github.com/tt-aho unpad was recently made multi-core -- is this specific config running single-core or multi-core?

@muthutt https://github.com/muthutt -- gdb is reporting that we're waiting on core {x = 7, y = 1} , so this test is either multi-core or there's some bug in this test or kernel launched on wrong core.

(gdb) up

4 0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=)

at /usr/include/c++/9/bits/char_traits.h:300

300 { c1 = c2; } (gdb) info locals chip_id = virtual_dram_core = { = {x = 7, y = 1}, chip = 0} (gdb) up

5 0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=, chip=0, core=..., addr=, size=)

at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38

38 tt_cxy_pair(std::size_t ichip, tt_xy_pair xy_pair) : tt_xy_pair(xy_pair.x, xy_pair.y), chip(ichip) {} (gdb) info locals read_hex_vec = std::vector of length 1, capacity 1 = {0} (gdb) up

6 0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailboxaddress=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429

429 run_mailbox_read_val = read_hex_vec_from_core( (gdb) info locals Python Exception <class 'gdb.error'> value has been optimized out: run_mailbox_read_val = cluster = @0x7ffc7b436e98: 0x64a0550 chip_id = @0x7ffc7b436e94: 0 core = @0x7ffc7b436f70: {x = 7, y = 1} cluster = chip_id = core = run_mailbox_read_val =

ok definitely a multi-core test, on 56 cores, got this from gdb:

(gdb) up

11 tt::tt_metal::Kernel::logical_cores (this=) at tt_metal/impl/kernels/kernel.cpp:33

33 for (auto core_range : this->core_rangeset.ranges()) { (gdb) info locals core_range = for_range = __for_begin = for_end = cores = std::set with 56 elements = {[0] = {x = 0, y = 0}, [1] = {x = 0, y = 1}, [2] = {x = 0, y = 2}, [3] = {x = 0, y = 3}, [4] = {x = 0, y = 4}, [5] = {x = 0, y = 5}, [6] = {x = 0, y = 6}, [7] = {x = 1, y = 0}, [8] = {x = 1, y = 1}, [9] = {x = 1, y = 2}, [10] = {x = 1, y = 3}, [11] = {x = 1, y = 4}, [12] = {x = 1, y = 5}, [13] = {x = 1, y = 6}, [14] = {x = 2, y = 0}, [15] = {x = 2, y = 1}, [16] = {x = 2, y = 2}, [17] = {x = 2, y = 3}, [18] = {x = 2, y = 4}, [19] = {x = 2, y = 5}, [20] = {x = 2, y = 6}, [21] = {x = 3, y = 0}, [22] = {x = 3, y = 1}, [23] = {x = 3, y = 2}, [24] = {x = 3, y = 3}, [25] = {x = 3, y = 4}, [26] = {x = 3, y = 5}, [27] = {x = 3, y = 6}, [28] = {x = 4, y = 0}, [29] = {x = 4, y = 1}, [30] = {x = 4, y = 2}, [31] = {x = 4, y = 3}, [32] = {x = 4, y = 4}, [33] = {x = 4, y = 5}, [34] = {x = 4, y = 6}, [35] = {x = 5, y = 0}, [36] = {x = 5, y = 1}, [37] = {x = 5, y = 2}, [38] = {x = 5, y = 3}, [39] = {x = 5, y = 4}, [40] = {x = 5, y = 5}, [41] = {x = 5, y = 6}, [42] = {x = 6, y = 0}, [43] = {x = 6, y = 1}, [44] = {x = 6, y = 2}, [45] = {x = 6, y = 3}, [46] = {x = 6, y = 4}, [47] = {x = 6, y = 5}, [48] = {x = 6, y = 6}, [49] = {x = 7, y = 0}, [50] = {x = 7, y = 1}, [51] = {x = 7, y = 2}, [52] = {x = 7, y = 3}, [53] = {x = 7, y = 4}, [54] = {x = 7, y = 5}, [55] = {x = 7, y = 6}}

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/2691#issuecomment-1722179791, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNDXP5LEZS7L44B4YZTX2VRUVANCNFSM6AAAAAA42XTZMM . You are receiving this because you were mentioned.Message ID: @.***>

kkwong10 commented 1 year ago

Trying to repro on my ird machine. I am not able to repro the hang 200 runs -- Machine configuration

image

I don't run warm-reset in between. -- Trying longer since it seems we crash on iter# 414 in standalone

abhullar-tt commented 1 year ago

Trying to repro on my ird machine. I am not able to repro the hang 200 runs -- Machine configuration image

I don't run warm-reset in between. -- Trying longer since it seems we crash on iter# 414 in standalone

Do you know what the harvesting config on this machine is?

kkwong10 commented 1 year ago

We are soft-harvesting for this machine:

harvesting: [
  0: {noc_translation: True, harvest_mask: 1},
]

This is first tensix row or y=1 is harvested

kkwong10 commented 1 year ago

An update, I managed to recreate the hang last night on row 3 on iteration 500 -- Which is odd since it does not match with the "harvested" row, which is tensix row 0 or noc physical y=1 in this case. Also, this machine is a soft-harvested configuration, so targeting the harvested row should not result in any erroneous behavior anyways

                 Always | INFO     | Not Done Cores: {(x=1,y=3), (x=2,y=3), (x=3,y=3), (x=4,y=3), (x=6,y=3), (x=7,y=3), (x=8,y=3), (x=9,y=3)}
kkwong10 commented 1 year ago

@davorchap

Interesting data-point: I added some telemetry + a small sleep (0.2s) before checking the core status again so I'm not blasting the log. As a result, I wasn't able to recreate the post-commit hang, currently at iteration 8 and still going strong. I think it is quite hard to get to the hang scenario when i perturb the timing. I'm rerunning without the sleep and seeing if I can still hit the hang for standalone.

I think there are few things I want to try.

  1. Create small unit tests which check our harvesting by reading the registers for NODE ID and verifying it matches the expected physical coords. I want to rule out harvesting, since I don't actually see any harvesting issues which i try to dump coordinates etc. Also it is weird that the issue is non-deterministic, but the fact a whole row is "hanging" is very suspicious.
  2. Try running this on the machine you hit this and see if i can hit this faster.
tt-rkim commented 1 year ago

Just posting here for visibility. Observed another hang on BM on a usually good machine:

Logs, still running but should time out soon: https://github.com/tenstorrent-metal/tt-metal/actions/runs/6256572531/job/16987607490

Stack trace

#0  0x00007fecbb961dc5 in std::_Rb_tree_iterator<std::pair<int const, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > > >::operator*() const ()
   from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#1  0x00007fecbb92402c in std::map<int, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > > > >::at(int const&) () from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#2  0x00007fecbb90d35a in tt_SiliconDevice::address_in_tlb_space(unsigned int, unsigned int, int, unsigned int, unsigned int) () from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#3  0x00007fecbb9157c1 in tt_SiliconDevice::read_device_memory(unsigned int*, tt_cxy_pair, unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#4  0x00007fecbb91f908 in tt_SiliconDevice::read_from_device(unsigned int*, tt_cxy_pair, unsigned long, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/build/lib/libdevice.so
#5  0x00007fecbc244e6a in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x55e1a408c850, dram_core=..., addr=16, size_in_bytes=4, small_access=<optimized out>) at /usr/include/c++/9/bits/char_traits.h:300
#6  0x00007fecbc257820 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>) at /home/tt-admin/actions-runner/_work/tt-metal/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
#7  0x00007fecbc2586e8 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=16, __closure=0x55e1a4090830) at tt_metal/llrt/llrt.cpp:384
#8  std::_Function_handler<bool(long unsigned int), tt::llrt::internal_::check_if_riscs_on_specified_core_done(tt_cluster*, int, const CoreCoord&)::<lambda(uint64_t)> >::_M_invoke(const std::_Any_data &, unsigned long &&) (__functor=..., __args#0=<optimized out>)
    at /usr/include/c++/9/bits/std_function.h:285
#9  0x00007fecbc258547 in std::function<bool (unsigned long)>::operator()(unsigned long) const (__args#0=<optimized out>, this=0x7fff80aba430) at /usr/include/c++/9/bits/std_function.h:683
#10 tt::llrt::internal_::check_if_riscs_on_specified_core_done (cluster=<optimized out>, chip_id=<optimized out>, core=...) at tt_metal/llrt/llrt.cpp:396
#11 0x00007fecbc1c5334 in tt::tt_metal::LaunchProgram (device=device@entry=0x55e1a3f13c40, program=..., stagger_start=stagger_start@entry=false) at tt_metal/tt_metal.cpp:519
#12 0x000055e1a2eba6a6 in tt::test::buffer::detail::SimpleTiledL1WriteCBRead (device=0x55e1a3f13c40, core=..., input_local_address=794624, intermed_local_address=<optimized out>, output_local_address=802816, byte_size=<optimized out>)
    at tests/tt_metal/tt_metal/unit_tests/buffer/test_simple_l1_buffer.cpp:102
#13 0x000055e1a2ebc398 in SingleDeviceFixture_TestSimpleL1ReadWritex2y2TileLo_Test::TestBody (this=0x55e1a3cb7eb0) at ./tt_metal/third_party/umd/device/tt_xy_pair.h:13
#14 0x00007fecbbec925e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /usr/local/lib/libgtest.so.1.13.0
#15 0x00007fecbbec12bd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) () from /usr/local/lib/libgtest.so.1.13.0
DrJessop commented 1 year ago

Ran this on Davor's t3005 and set it to run for 10K iterations, but forgot to log the iteration number and it hung. Launched again, and will report the number.

It does seem like fast dispatch improves stability, but it just delays the hang.

DrJessop commented 1 year ago

Reran on Davor's t3005. Made it to 683 iterations before reaching the hanging state.

kkwong10 commented 1 year ago

FYI, this hang-signature looks similar to the hang related to UMD seen on BBE as well: https://yyz-gitlab.local.tenstorrent.com/tenstorrent/budabackend/-/issues/1931

jvasilje commented 1 year ago

@abhullar-tt is this relevant?

abhullar-tt commented 1 year ago

I believe @kkwong10 was root causing this issue and it appears to be similar to the hang identified in BBE, I don't think this has been solved in UMD yet.

jvasilje commented 1 year ago

@davorchap should we repeat this stress test?

jvasilje commented 12 months ago

@abhullar-tt any update on this?

abhullar-tt commented 12 months ago

@abhullar-tt any update on this?

No updates. I can kick off a stress test to see if this still shows up

jliangTT commented 10 months ago

This issue is quite stale and has not been actioned on in the past months. Can we adjust the priority down?

pgkeller commented 10 months ago

imho, the problem isn't that this is too high a priority it is that we aren't assigning a resource. undiagnosed hangs should be critical

abhullar-tt commented 10 months ago

I kicked off a stress test on WH a month ago but didn't se the same hang signature. We probably should spend more time confirming whether we still see it or not

TT-billteng commented 10 months ago

WH stress test has been success for quite some time, is this test still part of post-commit suite?

vtangTT commented 9 months ago

I ran 50 iterations on WH on main (bf25587) and seeing no hangs. I don't believe this is an issue anymore.