Closed tt-rkim closed 3 weeks ago
I was able to repro a failure locally on commit fe1129c4e0f8d70173c9abe1aac59e5429ef8682.
Repro:
1B weights
FAKE_DEVICE=N300 pytest --count=100 -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-3_batch"
malloc(): unsorted double linked list corrupted
Fatal Python error: Aborted
Thread 0x00007fc8c8f79700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 306 in wait
File "/usr/lib/python3.8/threading.py", line 558 in wait
File "/home/cglagovich/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
~@DuongQLee Your commit has caused a regression in a llama model test.~
~Please investigate and fix, or we'll have to look into reverting your commit by end of week.~
Sorry that I wasn't clear - this is not a bisect, but just a repro. I don't know yet which commit introduced the error, but I am definitely able to see the failure locally as late as https://github.com/tenstorrent/tt-metal/commit/fe1129c4e0f8d70173c9abe1aac59e5429ef8682.
Sounds good - I edited my comment - my apologies @DuongQLee
Are you able to go back further to see? @cglagovichTT
As a tip you could try using the tt_bisect script and target a machine, let's say on cloud, to do this work for you
I'll launch that bisect. However I suspect that the cause for the failure has existed since before we merged this test
Hmm... and this was the llama refactor you guys did a couple weeks ago?
Yeah we just added this test a few weeks ago. Since the failure is ND it may or may not have existed since the time we've first added the test
This failure is very tough to repro. I have found that the failure is slightly more reproducible if I stress the system with something like stress --cpu 32 --vm 2 --vm-bytes 128M --timeout 120s
. I have twice reproed this as a memory corruption error during decode iterations, when tt-metal is deallocating tensors.
2024-10-31 15:00:27.120 | INFO | models.demos.llama3.demo.demo:run_llama3_demo:439 - Iteration 49: 33ms @ 30.5 tok/s/user (30.5 tok/s throughput)
free(): corrupted unsorted chunks
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7de9859 in __GI_abort () at abort.c:79
#2 0x00007ffff7e5426e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f7e298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7e5c2fc in malloc_printerr (str=str@entry=0x7ffff7f80718 "free(): corrupted unsorted chunks") at malloc.c:5347
#4 0x00007ffff7e5e042 in _int_free (av=0x7ff244000020, p=0x7ff831902030, have_lock=<optimized out>) at malloc.c:4356
#5 0x00007fff880f1e1f in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<3ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceHostStorage>(tt::tt_metal::MultiDeviceHostStorage&) const::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::owned_buffer::Buffer<unsigned char>, tt::tt_metal::owned_buffer<unsigned short>, tt::tt_metal::owned_buffer<int>, tt::tt_metal::owned_buffer<unsigned int>, tt::tt_metal::owned_buffer<float>, tt::tt_metal::owned_buffer<bfloat16> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceHostStorage>(tt::tt_metal::MultiDeviceHostStorage&) const::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::owned_buffer::Buffer<unsigned char>, tt::tt_metal::owned_buffer<unsigned short>, tt::tt_metal::owned_buffer<int>, tt::tt_metal::owned_buffer<unsigned int>, tt::tt_metal::owned_buffer<float>, tt::tt_metal::owned_buffer<bfloat16> >&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#6 0x00007fff880f1260 in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<3ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#7 0x00007fff880ec4cd in tt::tt_metal::Tensor::~Tensor() () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#8 0x00007fff88899416 in pybind11::class_<tt::tt_metal::Tensor>::dealloc(pybind11::detail::value_and_holder&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#9 0x00007fff8827362b in pybind11::detail::clear_instance(_object*) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#10 0x00007fff88273414 in pybind11_object_dealloc () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
repro steps:
cglagovich/14475
(latest main + one commit which increases length of test)pip install pytest-repeat
env
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
LLAMA_DIR=/proj_sw/user_dev/llama32-data/Llama3.2-11B-Vision-Instruct/
Command:
FAKE_DEVICE=N300 pytest --count=100 -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-100_batch"
Assigning @tt-asaigal .
Bisect was inconclusive. As early as 0af26ed fails with segfault, but what I found was that different commits could take anywhere between 5 and 45 minutes to fail. Therefore it is tough to say whether a commit which passed 1 hour of testing is actually "good", so this failure may have been present for a long time.
The issue is with the readback logic in the demo:
tt_output_torch = ttnn.to_torch(
tt_out_tok.cpu(blocking=False, cq_id=1), mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1) <<<<<< THIS NON BLOCKING READ. THE DEVICE AND WORKER THREADS ARE NOT GUARANTEED TO POPULATE THIS TENSOR IMMEDIATELY
)[0, 0, 0, :batch_size] <<< INDEXING INTO A POTENTIALLY EMPTY TENSOR, WHICH WILL GET POPULATED ASYNCHRONOUSLY
ttnn.record_event(1, write_event)
# Save output token to print out later
for user in range(batch_size):
user_tok = tt_output_torch[user].tolist(). <<<<< NO GUARANTEE OUTPUT IS POPULATED HERE EITHER
if user_tok != 28803 and user_done[user] == False: # Stop saving the ouput after hitting the EOS token
all_outputs[user].append(user_tok)
else:
user_done[user] = True
logger.trace(f"[User {user}] Finished decoding at iteration {iteration}")
if all(user_done):
users_decoding = False
Since reads are non-blocking, there is no guarantee that the output tensor is populated by the time its read in this setup. Its up to the user to ensure that the device and worker threads have been blocked on before indexing into the output.
Changing the read to be blocking resolves the segfault (ran for almost 2 hours and saw no issues). @cglagovichTT I've pushed a change to https://github.com/tenstorrent/tt-metal/tree/refs/heads/asaigal/14475 and can make a PR if this solution is okay (seeing 90-93t/s/u with blocking reads).
A proper solution allowing non-blocking reads for LLMs would be as follows:
n
iterations in a nonblocking manner and append your outputs to a list. The outputs will not be synchronously populated.ttnn.event_synchronize(event_list[iteration_count - n])
. This is a host to device sync, where host waits for output iteration_count - n
to get populated (at iteration n
this will be the first token).n + 1
.This way, after the warmup iterations, you'll have inputs buffered up and will always be dispatching inference requests n
tokens ahead. If n
is large enough, the synchronize call will just be a nop thats there for safety.
Thanks @tt-asaigal !
We'll measure perf with blocking and with the output buffer to make a decision. Depending on the necessary number of iterations for proper buffering the second option might be a problem, as it will affect the time to first token.
However, for an immediate fix before we complete a perf analysis I think we can push the blocking. Thoughts @cglagovichTT @yieldthought ?
I'm in favor of merging the blocking PR to solve the segfault then implement Aditya's proper solution
I will include Aditya's fix in the llama-70b PR (which is now done and under revision).
why is this still a P0? @cglagovichTT @yieldthought
Changing to P1 since there's a resolution on branch. @mtairum please close once you merge https://github.com/tenstorrent/tt-metal/pull/14273
Blocking fix from Aditya was merged to main. We will monitor the state of CI in the coming days/weeks to see if the issues are still present after the fix.
Thanks bossmans @tt-asaigal @mtairum @cglagovichTT
https://github.com/tenstorrent/tt-metal/actions/runs/11592741854/job/32283217219#step:10:1136 https://github.com/tenstorrent/tt-metal/actions/runs/11584418516/job/32251570169#step:10:1169 https://github.com/tenstorrent/tt-metal/actions/runs/11587872413/job/32283211792#step:10:1136
@cglagovichTT says it could also be an ND segfault since last wednesday.