tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 25 forks source link

[Llama] first run with generating positional rotation matrix caches segfaults and OOMs #9837

Open tstescoTT opened 5 days ago

tstescoTT commented 5 days ago

Describe the bug

With a fresh tt-metal weights cache for llama2 and llama3 on first run the rotation matrices (rot mats) are cached for later use. For example:

2024-05-02 18:06:09.559 | DEBUG    | ttnn.operations.core:from_torch_and_dump:676 - Generating cache for /home/tt-admin/.cache/tt-metal-llama2-70b/llama2_cache/tt-metal-weights-cache/rot_mat_decode_69_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1, 128, 128]), dtype BFLOAT16, layout TILE

During this first run caching segfaults typically occur as the token position is increased.

Workaround: a soft reset (tt-smi -r 0,1,2,3) can be used to reset the device and run again to generate caches for a higher token position until the entire max seq len is reached.

Without doing first-run generation for the entire max seq len, the segfaults or hangs may occur during applications if the current seq len does not have cached rot mats. Ideally this would be part of the inital model set up for applications to avoid unpredictable caching during application runtime.

To Reproduce Steps to reproduce the behavior:

  1. build tt-metal on commit a053bc8c9cc380804db730ed7ed084d104abb6a0
  2. setup llama3 experimental demo (see https://github.com/tenstorrent/tt-metal/tree/main/models/experimental/llama2_70b#how-to-run)
  3. run demo_first_run_4k.py script (https://gist.github.com/tstescoTT/86e31370590666e0edb920bd6bf615aa#file-demo_first_run_4k-py) forcing 4k token generation. pytest -svv demo_first_run_4k.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]

Expected behavior The rot mat cache generation should not cause segfaults or OOMs. Ideally there should be a way to optionally pre-compute all the rot mats ahead of application runtime to avoid unexpected caching and resulting issues, e.g. with read-only file systems.

Example traces Example segfault:

2024-06-27 16:34:04.410 | INFO     | demo:run_decode:199 - Loop 86 user 31: 

Fatal Python error: Segmentation fault

Thread 0x00007f0ac5ffb700 (most recent call first):
  File "/usr/lib/python3.8/threading.py", line 306 in wait
  File "/usr/lib/python3.8/threading.py", line 558 in wait
  File "/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f0e2bd01740 (most recent call first):
  File "/usr/lib/python3.8/pathlib.py", line 704 in _format_parsed_parts
  File "/usr/lib/python3.8/pathlib.py", line 722 in __str__
  File "/tt-metal/ttnn/ttnn/operations/core.py", line 619 in load_tensor
  File "/tt-metal/ttnn/ttnn/decorators.py", line 782 in wrapper
  File "/tt-metal/ttnn/ttnn/operations/core.py", line 765 in as_tensor
  File "/tt-metal/ttnn/ttnn/decorators.py", line 782 in wrapper
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_model_optimized.py", line 247 in prepare_inputs
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_generation.py", line 79 in decode_forward
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/tt/llama_generation.py", line 66 in forward
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 176 in run_decode
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 56 in main
  File "/home/user/tt-metal-llama3-70b/src/tt_metal_impl/demo/demo.py", line 450 in test_LlamaModel_demo
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/tt-metal/python_env/bin/pytest", line 8 in <module>
Segmentation fault (core dumped)

At higher token positions DRAM OOM occured:

2024-06-27 19:02:51.022 | INFO     | __main__:run_decode:199 - Loop 1001

2024-06-27 19:02:51.127 | INFO     | __main__:run_decode:199 - Loop 1002

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7fe17f4ad958]
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7fe17f0b1928]
 --- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7fe17f053299]
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7fe17f05288f]
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7fe17f051f7d]
 --- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7fe17f050994]
 --- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7fe17f0b0709]
 --- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7fe17f37be39]
 --- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7fe17f37cb6f]
 --- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7fe17ec9b9eb]
 --- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7fe17ec9bc5b]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fe1eb72a609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fe1eb864353]

Aborted (core dumped)

Please complete the following environment information:

cglagovichTT commented 2 days ago

I was able to repro this segfault on sjc-snva-t3002.

Config:

image
2024-07-02 16:45:32.806 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_25_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1,│······
 32, 32]), dtype BFLOAT16, layout TILE                                                                                                                                                                                                │······
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │······
                                                                                                                                                                                                                                      │······
Thread 223 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │······
[Switching to Thread 0x7ffe82fd7700 (LWP 773289)]                                                                                                                                                                                     │······
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │······
(gdb) bt                                                                                                                                                                                                                              │······
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │······
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │······
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │······
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│······
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │······
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│······
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │······
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │······
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│······
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │······
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │······
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95                                                                                                                                                    │······
(gdb) list                                                                                                                                                                                                                            │······
1       <built-in>: No such file or directory.                                                                                                                                                                                        │······
(gdb)
cglagovichTT commented 2 days ago

When I run with async disabled, I see a variety of errors.

Case 1:

image
2024-07-02 17:59:25.628 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_109_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·, 32, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·Thread 218 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe857dc700 (LWP 820677)]                                                                                                                                                                                     │·
0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeList:│·
:Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                   │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeL│·
ist::Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                               │·#1  0x00007fff8883432f in tt::tt_metal::allocator::FreeList::allocate_slice_of_free_block(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, unsigned long, unsigned long) ()                                         │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#2  0x00007fff88834754 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·
#3  0x00007fff88837075 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·#4  0x00007fff88838f17 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff8883905f in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#6  0x00007fff8887398e in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·#7  0x00007fff888760c4 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·#8  0x00007fff88873a85 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#9  0x00007fff88822c63 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·#10 0x00007fff888215d8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·#11 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#12 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·
#13 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#14 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·#15 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()
cglagovichTT commented 2 days ago

case 2: deallocate

image
2024-07-02 18:42:19.973 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_rot_mat_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 32, │·
128, 128]), dtype BFLOAT16, layout TILE                                                                                                                                                                                               │·
2024-07-02 18:42:19.977 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 838272)]                                                                                                                                                                                     │·
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                         │·
(gdb) list                                                                                                                                                                                                                            │·
1       <built-in>: No such file or directory.                                                                                                                                                                                        │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                     │·
#1  0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#2  0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so       │·
#3  0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│·
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                         │·
#4  0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│·
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                               │·
#5  0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                   │·
#6  0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│·
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                          │·
#7  0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                                                                  │·
#8  0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
cglagovichTT commented 2 days ago

Case 3: search_first

image
2024-07-02 18:57:17.853 | DEBUG    | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_149_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE                                                                                                                                                                                              │·
--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                      │·
                                                                                                                                                                                                                                      │·
Thread 219 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                      │·
[Switching to Thread 0x7ffe86fdf700 (LWP 857393)]                                                                                                                                                                                     │·
\0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                │·
(gdb) bt                                                                                                                                                                                                                              │·
#0  0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                             │·
#1  0x00007fff888347a9 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                  │·#2  0x00007fff888378d5 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so      │·
#3  0x00007fff88839777 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) ()                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·
#4  0x00007fff888398bf in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) ()                                  │·
   from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                            │·#5  0x00007fff888741ee in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                       │·
#6  0x00007fff88876924 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                              │·
#7  0x00007fff888742e5 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                              │·
#8  0x00007fff88822c73 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                 │·
#9  0x00007fff888215e8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·
) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so                                                                                                                                                                          │·
#10 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) ()                                                      │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#11 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·
__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                       │·#12 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) ()                                   │·   from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                                                                                                                                            │·#13 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so                                                          │·
#14 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()
cglagovichTT commented 1 day ago

I was not able to repro this segfault with async queues disabled.

In one of the deallocate segfaults in a worker thread, I see that the main thread is involved in sending a tensor to device.

image

This made me wonder if this code pattern is the culprit:

            rot_mats = ttnn.as_tensor(
                rot_mat,
                dtype=ttnn.bfloat16,
                layout=ttnn.TILE_LAYOUT,
                device=self.device_mesh,
                cache_file_name=cache_name(f"rot_mat_decode_{start_pos}"),
                memory_config=self.model_config["DRAM_MEMCFG"],
                mesh_mapper=ReplicateTensorToMesh(self.device_mesh),
            )
            rot_mats = ttnn.to_device(rot_mats, self.device_mesh)

The to_device should be unnecessary but not incorrect. I ran the test again with this call removed, but the segfaults did not go away.

cglagovichTT commented 1 day ago

Repro instructions:

gdb --args python -m pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]

Expected output:

--Type <RET> for more, q to quit, c to continue without paging--

Thread 224 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe827d6700 (LWP 369024)]
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so
(gdb) q
tstescoTT commented 23 hours ago

On a new T3000 machine to get the first run to 2816 tokens generated in a single sequence I got 6 crashes:

2024-07-03-Kuaishou-ttsmi

I did a soft reset tt-smi -r 0,1,2,3 after each crash and reran the first run script.

crash 1:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:26:09.820 | INFO     | __main__:run_decode:199 - Loop 5

free(): invalid pointer
Aborted (core dumped)

2:

(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:33:32.714 | INFO     | __main__:run_decode:199 - Loop 88

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7f063a443958]
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::op
tional<tt::tt_metal::ShardSpecBuffer> const&)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
 --- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__
16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7f063a047928]
 --- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7f0639fe9299]
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Te
nsor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt
::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std
::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
 --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::ope
ration::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::v
ector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::_
_1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7f0639fe888f]
 --- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7f0639fe7f7d]
 --- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::T
ensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>,
 std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocato
r<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tens
or>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
 --- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7f0639fe6994]
 --- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7f063a046709]
 --- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7f063a311e39]
 --- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7f063a312b6f]
 --- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7f0639c319eb]
 --- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7f0639c31c5b]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f06a66c0609]
 --- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f06a67fa353]

Aborted (core dumped)

crash 3 (same stack trace as above):

2024-07-03 11:40:37.964 | INFO     | __main__:run_decode:199 - Loop 127

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 4 (same stack trace as above):

2024-07-03 11:49:00.270 | INFO     | __main__:run_decode:199 - Loop 489

                 Always | FATAL    | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception

crash 5:

2024-07-03 11:57:17.461 | INFO     | __main__:run_decode:199 - Loop 847

2024-07-03 11:57:17.573 | INFO     | __main__:run_decode:199 - Loop 848

Segmentation fault (core dumped)

crash 6 (hang)

2024-07-03 12:13:12.789 | INFO     | __main__:run_decode:199 - Loop 913

2024-07-03 12:13:12.903 | INFO     | __main__:run_decode:199 - Loop 914

^C^C^C^C^CTerminated

Rerunning after this crash got to 2816 tokens and gets to the known issue https://github.com/tenstorrent/tt-metal/issues/9839. This completes the first run and generation for 2k context is relatively reliable.

tt-asaigal commented 23 hours ago

Hey @tstescoTT would you mind running with this commit cherry-picked: https://github.com/tenstorrent/tt-metal/commit/4558673547d6ba9fda448f99ef8a7e3c3e5ffb45. It resolved the segfault for me locally