WHB0 testcase aborts with stack trace

muthutt commented 9 months ago

I'm investigating test failure in the Wormhole B0 test, pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_digamma.py , and I come across the following stack trace in the Metal code:

Thread 1 "pytest" received signal SIGSEGV, Segmentation fault.
0x00007ff3ad4cba1d in SystemMemoryManager::get_issue_queue_limit (cq_id=0 '\000', this=0x0) at ./tt_metal/impl/dispatch/command_queue_interface.hpp:187
187         return this->cq_interfaces[cq_id].issue_fifo_limit << 4;
(gdb) up
#1  tt::tt_metal::CommandQueue::enqueue_write_buffer (this=0x56d0370, buffer=..., src=0x7c0ff10, blocking=false)
    at tt_metal/impl/dispatch/command_queue.cpp:929
929     const uint32_t command_issue_limit = this->manager.get_issue_queue_limit(this->id);
(gdb) 
#2  0x00007ff3ad4cc2ce in tt::tt_metal::EnqueueWriteBuffer (cq=..., buffer=..., src=src@entry=0x7c0ff10, blocking=blocking@entry=false)
    at tt_metal/impl/dispatch/command_queue.cpp:1139
1139        cq.enqueue_write_buffer(buffer, src, blocking);
(gdb) down 1
#1  tt::tt_metal::CommandQueue::enqueue_write_buffer (this=0x56d0370, buffer=..., src=0x7c0ff10, blocking=false)
    at tt_metal/impl/dispatch/command_queue.cpp:929
929     const uint32_t command_issue_limit = this->manager.get_issue_queue_limit(this->id);
(gdb) down 1
#0  0x00007ff3ad4cba1d in SystemMemoryManager::get_issue_queue_limit (cq_id=0 '\000', this=0x0) at ./tt_metal/impl/dispatch/command_queue_interface.hpp:187
187         return this->cq_interfaces[cq_id].issue_fifo_limit << 4;
(gdb) down 1
#0  0x00007ff3ad4cba1d in SystemMemoryManager::get_issue_queue_limit (cq_id=0 '\000', this=0x0) at ./tt_metal/impl/dispatch/command_queue_interface.hpp:187
187         return this->cq_interfaces[cq_id].issue_fifo_limit << 4;
(gdb) down 
Bottom (innermost) frame selected; you cannot go down.
(gdb) bt
#0  0x00007ff3ad4cba1d in SystemMemoryManager::get_issue_queue_limit (cq_id=0 '\000', this=0x0) at ./tt_metal/impl/dispatch/command_queue_interface.hpp:187
#1  tt::tt_metal::CommandQueue::enqueue_write_buffer (this=0x56d0370, buffer=..., src=0x7c0ff10, blocking=false)
    at tt_metal/impl/dispatch/command_queue.cpp:929
#2  0x00007ff3ad4cc2ce in tt::tt_metal::EnqueueWriteBuffer (cq=..., buffer=..., src=src@entry=0x7c0ff10, blocking=blocking@entry=false)
    at tt_metal/impl/dispatch/command_queue.cpp:1139
#3  0x00007ff3adb73e34 in tt::tt_metal::tensor_impl::write_data_to_device_buffer<bfloat16, tt::tt_metal::owned_buffer::Buffer> (shape=..., 
    data_type=<optimized out>, layout=tt::tt_metal::Layout::TILE, memory_config=..., buffer=..., data_to_write=...)
    at ./tt_metal/impl/buffers/buffer.hpp:129

muthutt commented 8 months ago

still the same

muthutt commented 8 months ago

This is not specific to digamma - the tensor size being too large for Wormhole B0 Nebula X2. Perhaps an issue with Tensor size allocation when device cannot accommodate the dimension should be flagged earlier.

tenstorrent / tt-metal

WHB0 testcase aborts with stack trace #4818