Open tstescoTT opened 5 days ago
I was able to repro this segfault on sjc-snva-t3002.
Config:
2024-07-02 16:45:32.806 | DEBUG | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_25_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1,│······
32, 32]), dtype BFLOAT16, layout TILE │······
--Type <RET> for more, q to quit, c to continue without paging-- │······
│······
Thread 223 "python" received signal SIGSEGV, Segmentation fault. │······
[Switching to Thread 0x7ffe82fd7700 (LWP 773289)] │······
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
(gdb) bt │······
#0 0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
#1 0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
#2 0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
#3 0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│······
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │······
#4 0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│······
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │······
#5 0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
#6 0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│······
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │······
#7 0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477 │······
#8 0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 │······
(gdb) list │······
1 <built-in>: No such file or directory. │······
(gdb)
When I run with async disabled, I see a variety of errors.
Case 1:
2024-07-02 17:59:25.628 | DEBUG | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_109_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·, 32, 128]), dtype BFLOAT16, layout TILE │·
--Type <RET> for more, q to quit, c to continue without paging-- │·
│·Thread 218 "python" received signal SIGSEGV, Segmentation fault. │·
[Switching to Thread 0x7ffe857dc700 (LWP 820677)] │·
0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeList:│·
:Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
(gdb) bt │·
#0 0x00007fff88833ff5 in tt::tt_metal::allocator::FreeList::update_left_aligned_allocated_block_connections(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, boost::local_shared_ptr<tt::tt_metal::allocator::FreeL│·
ist::Block>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#1 0x00007fff8883432f in tt::tt_metal::allocator::FreeList::allocate_slice_of_free_block(boost::local_shared_ptr<tt::tt_metal::allocator::FreeList::Block>, unsigned long, unsigned long) () │· from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#2 0x00007fff88834754 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#3 0x00007fff88837075 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#4 0x00007fff88838f17 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) () │· from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#5 0x00007fff8883905f in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) () │·
from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#6 0x00007fff8887398e in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#7 0x00007fff888760c4 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#8 0x00007fff88873a85 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#9 0x00007fff88822c63 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#10 0x00007fff888215d8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#11 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) () │· from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#12 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·
#13 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) () │· from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#14 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#15 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()
case 2: deallocate
2024-07-02 18:42:19.973 | DEBUG | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_rot_mat_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 32, │·
128, 128]), dtype BFLOAT16, layout TILE │·
2024-07-02 18:42:19.977 | DEBUG | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_129_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE │·
--Type <RET> for more, q to quit, c to continue without paging-- │·
│·
Thread 219 "python" received signal SIGSEGV, Segmentation fault. │·
[Switching to Thread 0x7ffe86fdf700 (LWP 838272)] │·
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
(gdb) list │·
1 <built-in>: No such file or directory. │·
(gdb) bt │·
#0 0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#1 0x00007fff88875f17 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#2 0x00007fff88873b82 in tt::tt_metal::EnqueueDeallocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Allocator&, unsigned int, tt::tt_metal::BufferType, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#3 0x00007fff89071d70 in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda(tt::tt_metal::Device*)#1}, std::__1│·
::allocator<{lambda(tt::tt_metal::Device*)#1}>, void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*&&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·
#4 0x00007fff8907200f in std::__1::__function::__func<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceStorage>(tt::tt_metal::MultiDeviceStorage&) const::{lambda()#1}, std::__1::allocator<{lambda()│·
#1}>, void ()>::operator()() () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·
#5 0x00007fff8881429b in tt::WorkExecutor::run_worker() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#6 0x00007fff8881450b in void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (tt::WorkExecutor::*)(), tt::WorkEx│·
ecutor*> >(void*) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#7 0x00007ffff7db5609 in start_thread (arg=<optimized out>) at pthread_create.c:477 │·
#8 0x00007ffff7eef353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Case 3: search_first
2024-07-02 18:57:17.853 | DEBUG | ttnn.operations.core:from_torch_and_dump:739 - Generating cache for /proj_sw/llama3-data-cache/llama3_attn_masks_decode_149_multi_device_dtype_BFLOAT16_layout_TILE.bin of shape ttnn.Shape([1, 1│·
, 32, 160]), dtype BFLOAT16, layout TILE │·
--Type <RET> for more, q to quit, c to continue without paging-- │·
│·
Thread 219 "python" received signal SIGSEGV, Segmentation fault. │·
[Switching to Thread 0x7ffe86fdf700 (LWP 857393)] │·
\0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
(gdb) bt │·
#0 0x00007fff88833d13 in tt::tt_metal::allocator::FreeList::search_first(unsigned long, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#1 0x00007fff888347a9 in tt::tt_metal::allocator::FreeList::allocate(unsigned long, bool, unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#2 0x00007fff888378d5 in tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#3 0x00007fff88839777 in tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>) () │· from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#4 0x00007fff888398bf in tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>) () │·
from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·#5 0x00007fff888741ee in tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#6 0x00007fff88876924 in tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#7 0x00007fff888742e5 in tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#8 0x00007fff88822c73 in tt::tt_metal::Buffer::allocate() () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#9 0x00007fff888215e8 in tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool│·
) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so │·
#10 0x00007fff8905c6a8 in std::__1::shared_ptr<tt::tt_metal::Buffer> std::__1::allocate_shared[abi:ue170006]<tt::tt_metal::Buffer, std::__1::allocator<tt::tt_metal::Buffer>, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt│·::tt_metal::BufferType const&, void>(std::__1::allocator<tt::tt_metal::Buffer> const&, tt::tt_metal::Device*&, unsigned int&, unsigned int&, tt::tt_metal::BufferType const&) () │· from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#11 0x00007fff88fa6e4a in tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::│·
__1::optional<tt::tt_metal::ShardSpecBuffer> const&) () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#12 0x00007fff8906e84a in tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) () │· from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·#13 0x00007fff88c37418 in tt::tt_metal::operation::generic_create_output_tensors<tt::operations::primary::Matmul> () from /home/cglagovich/tt-metal/build/lib/libtt_eager.so │·
#14 0x00007fff88c2f1ba in tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&) const ()
I was not able to repro this segfault with async queues disabled.
In one of the deallocate
segfaults in a worker thread, I see that the main thread is involved in sending a tensor to device.
This made me wonder if this code pattern is the culprit:
rot_mats = ttnn.as_tensor(
rot_mat,
dtype=ttnn.bfloat16,
layout=ttnn.TILE_LAYOUT,
device=self.device_mesh,
cache_file_name=cache_name(f"rot_mat_decode_{start_pos}"),
memory_config=self.model_config["DRAM_MEMCFG"],
mesh_mapper=ReplicateTensorToMesh(self.device_mesh),
)
rot_mats = ttnn.to_device(rot_mats, self.device_mesh)
The to_device
should be unnecessary but not incorrect. I ran the test again with this call removed, but the segfaults did not go away.
Repro instructions:
ondemand
(seems to help with repro)gdb --args python -m pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]
Expected output:
--Type <RET> for more, q to quit, c to continue without paging--
Thread 224 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe827d6700 (LWP 369024)]
0x00007fff88834d03 in tt::tt_metal::allocator::FreeList::deallocate(unsigned long) () from /home/cglagovich/tt-metal/build/lib/libtt_metal.so
(gdb) q
On a new T3000 machine to get the first run to 2816 tokens generated in a single sequence I got 6 crashes:
I did a soft reset tt-smi -r 0,1,2,3
after each crash and reran the first run script.
crash 1:
(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:26:09.820 | INFO | __main__:run_decode:199 - Loop 5
free(): invalid pointer
Aborted (core dumped)
2:
(python_env) user@66a27c372dce:~/tt-metal-llama3-70b/src$ python tt_metal_impl/demo/demo_llama3_first_run_4k.py
...
2024-07-03 11:33:32.714 | INFO | __main__:run_decode:199 - Loop 88
Always | FATAL | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
info:
Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
backtrace:
--- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
--- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
--- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
--- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
--- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
--- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
--- tt::tt_metal::Buffer::allocate()
--- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, bool)
--- /tt-metal/build/lib/libtt_eager.so(+0x648958) [0x7f063a443958]
--- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::op
tional<tt::tt_metal::ShardSpecBuffer> const&)
--- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
--- /tt-metal/build/lib/libtt_eager.so(_ZN2tt8tt_metal9operation29generic_create_output_tensorsINS_10operations7primary6MatmulEEENS1_21program_output_helperIT_Xsr18has_create_programIS7_EE5valueEE4typeERKS7_RKNSt3__
16vectorINS0_6TensorENSC_9allocatorISE_EEEENSC_8optionalINS0_8DataTypeEEENS0_6LayoutERKNSK_INS0_12MemoryConfigEEE+0x178) [0x7f063a047928]
--- tt::operations::primary::Matmul::create_output_tensors(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&) const
--- /tt-metal/build/lib/libtt_eager.so(+0x1ee299) [0x7f0639fe9299]
--- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::detail::run_device_operation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Te
nsor>>>(std::__1::reference_wrapper<tt::tt_metal::CommandQueue>, tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt
::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std
::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)
--- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::ope
ration::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::v
ector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::_
_1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
--- /tt-metal/build/lib/libtt_eager.so(+0x1ed88f) [0x7f0639fe888f]
--- /tt-metal/build/lib/libtt_eager.so(+0x1ecf7d) [0x7f0639fe7f7d]
--- tt::tt_metal::operation::launch_op(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::T
ensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>,
std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocato
r<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tens
or>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
--- /tt-metal/build/lib/libtt_eager.so(+0x1eb994) [0x7f0639fe6994]
--- /tt-metal/build/lib/libtt_eager.so(+0x24b709) [0x7f063a046709]
--- /tt-metal/build/lib/libtt_eager.so(+0x516e39) [0x7f063a311e39]
--- /tt-metal/build/lib/libtt_eager.so(+0x517b6f) [0x7f063a312b6f]
--- /tt-metal/build/lib/libtt_metal.so(+0x1579eb) [0x7f0639c319eb]
--- /tt-metal/build/lib/libtt_metal.so(+0x157c5b) [0x7f0639c31c5b]
--- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f06a66c0609]
--- /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f06a67fa353]
Aborted (core dumped)
crash 3 (same stack trace as above):
2024-07-03 11:40:37.964 | INFO | __main__:run_decode:199 - Loop 127
Always | FATAL | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
crash 4 (same stack trace as above):
2024-07-03 11:49:00.270 | INFO | __main__:run_decode:199 - Loop 489
Always | FATAL | Out of Memory: Not enough space to allocate 1048576 B DRAM buffer across 12 banks, where each bank needs to store 88064 B
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
crash 5:
2024-07-03 11:57:17.461 | INFO | __main__:run_decode:199 - Loop 847
2024-07-03 11:57:17.573 | INFO | __main__:run_decode:199 - Loop 848
Segmentation fault (core dumped)
crash 6 (hang)
2024-07-03 12:13:12.789 | INFO | __main__:run_decode:199 - Loop 913
2024-07-03 12:13:12.903 | INFO | __main__:run_decode:199 - Loop 914
^C^C^C^C^CTerminated
Rerunning after this crash got to 2816 tokens and gets to the known issue https://github.com/tenstorrent/tt-metal/issues/9839. This completes the first run and generation for 2k context is relatively reliable.
Hey @tstescoTT would you mind running with this commit cherry-picked: https://github.com/tenstorrent/tt-metal/commit/4558673547d6ba9fda448f99ef8a7e3c3e5ffb45. It resolved the segfault for me locally
Describe the bug
With a fresh tt-metal weights cache for llama2 and llama3 on first run the rotation matrices (rot mats) are cached for later use. For example:
During this first run caching segfaults typically occur as the token position is increased.
Workaround: a soft reset (tt-smi -r 0,1,2,3) can be used to reset the device and run again to generate caches for a higher token position until the entire max seq len is reached.
Without doing first-run generation for the entire max seq len, the segfaults or hangs may occur during applications if the current seq len does not have cached rot mats. Ideally this would be part of the inital model set up for applications to avoid unpredictable caching during application runtime.
To Reproduce Steps to reproduce the behavior:
demo_first_run_4k.py
script (https://gist.github.com/tstescoTT/86e31370590666e0edb920bd6bf615aa#file-demo_first_run_4k-py) forcing 4k token generation.pytest -svv demo_first_run_4k.py::test_LlamaModel_demo[wormhole_b0-True-check_disabled-greedy-tt-70b-T3000-80L-decode_only-text_completion-llama3]
Expected behavior The rot mat cache generation should not cause segfaults or OOMs. Ideally there should be a way to optionally pre-compute all the rot mats ahead of application runtime to avoid unexpected caching and resulting issues, e.g. with read-only file systems.
Example traces Example segfault:
At higher token positions DRAM OOM occured:
Please complete the following environment information: