Investigate failure of llama demo in single-card demos

tt-rkim commented 1 month ago

https://github.com/tenstorrent/tt-metal/actions/runs/11592741854/job/32283217219#step:10:1136 https://github.com/tenstorrent/tt-metal/actions/runs/11584418516/job/32251570169#step:10:1169 https://github.com/tenstorrent/tt-metal/actions/runs/11587872413/job/32283211792#step:10:1136

@cglagovichTT says it could also be an ND segfault since last wednesday.

cglagovichTT commented 1 month ago

I was able to repro a failure locally on commit fe1129c4e0f8d70173c9abe1aac59e5429ef8682.

Repro: 1B weights FAKE_DEVICE=N300 pytest --count=100 -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-3_batch"

malloc(): unsorted double linked list corrupted
Fatal Python error: Aborted

Thread 0x00007fc8c8f79700 (most recent call first):
  File "/usr/lib/python3.8/threading.py", line 306 in wait
  File "/usr/lib/python3.8/threading.py", line 558 in wait
  File "/home/cglagovich/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

tt-rkim commented 1 month ago

~@DuongQLee Your commit has caused a regression in a llama model test.~

~Please investigate and fix, or we'll have to look into reverting your commit by end of week.~

cglagovichTT commented 1 month ago

Sorry that I wasn't clear - this is not a bisect, but just a repro. I don't know yet which commit introduced the error, but I am definitely able to see the failure locally as late as https://github.com/tenstorrent/tt-metal/commit/fe1129c4e0f8d70173c9abe1aac59e5429ef8682.

tt-rkim commented 1 month ago

Sounds good - I edited my comment - my apologies @DuongQLee

Are you able to go back further to see? @cglagovichTT

tt-rkim commented 1 month ago

As a tip you could try using the tt_bisect script and target a machine, let's say on cloud, to do this work for you

cglagovichTT commented 1 month ago

I'll launch that bisect. However I suspect that the cause for the failure has existed since before we merged this test

tt-rkim commented 1 month ago

Hmm... and this was the llama refactor you guys did a couple weeks ago?

cglagovichTT commented 1 month ago

Yeah we just added this test a few weeks ago. Since the failure is ND it may or may not have existed since the time we've first added the test

cglagovichTT commented 1 month ago

This failure is very tough to repro. I have found that the failure is slightly more reproducible if I stress the system with something like stress --cpu 32 --vm 2 --vm-bytes 128M --timeout 120s. I have twice reproed this as a memory corruption error during decode iterations, when tt-metal is deallocating tensors.

2024-10-31 15:00:27.120 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:439 - Iteration 49: 33ms @ 30.5 tok/s/user (30.5 tok/s throughput)
free(): corrupted unsorted chunks

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7de9859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e5426e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f7e298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e5c2fc in malloc_printerr (str=str@entry=0x7ffff7f80718 "free(): corrupted unsorted chunks") at malloc.c:5347
#4  0x00007ffff7e5e042 in _int_free (av=0x7ff244000020, p=0x7ff831902030, have_lock=<optimized out>) at malloc.c:4356
#5  0x00007fff880f1e1f in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<3ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceHostStorage>(tt::tt_metal::MultiDeviceHostStorage&) const::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::owned_buffer::Buffer<unsigned char>, tt::tt_metal::owned_buffer<unsigned short>, tt::tt_metal::owned_buffer<int>, tt::tt_metal::owned_buffer<unsigned int>, tt::tt_metal::owned_buffer<float>, tt::tt_metal::owned_buffer<bfloat16> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0::operator()<tt::tt_metal::MultiDeviceHostStorage>(tt::tt_metal::MultiDeviceHostStorage&) const::{lambda(auto:1&&)#1}>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::owned_buffer::Buffer<unsigned char>, tt::tt_metal::owned_buffer<unsigned short>, tt::tt_metal::owned_buffer<int>, tt::tt_metal::owned_buffer<unsigned int>, tt::tt_metal::owned_buffer<float>, tt::tt_metal::owned_buffer<bfloat16> >&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#6  0x00007fff880f1260 in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<3ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::Tensor::deallocate(bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#7  0x00007fff880ec4cd in tt::tt_metal::Tensor::~Tensor() () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#8  0x00007fff88899416 in pybind11::class_<tt::tt_metal::Tensor>::dealloc(pybind11::detail::value_and_holder&) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#9  0x00007fff8827362b in pybind11::detail::clear_instance(_object*) () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so
#10 0x00007fff88273414 in pybind11_object_dealloc () from /home/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so

cglagovichTT commented 1 month ago

repro steps:

get x2 WH on aus weka
check out cglagovich/14475 (latest main + one commit which increases length of test)

pip install pytest-repeat env

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
LLAMA_DIR=/proj_sw/user_dev/llama32-data/Llama3.2-11B-Vision-Instruct/

Command:

FAKE_DEVICE=N300 pytest --count=100  -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-100_batch"

cglagovichTT commented 1 month ago

Assigning @tt-asaigal .

Bisect was inconclusive. As early as 0af26ed fails with segfault, but what I found was that different commits could take anywhere between 5 and 45 minutes to fail. Therefore it is tough to say whether a commit which passed 1 hour of testing is actually "good", so this failure may have been present for a long time.

tt-asaigal commented 1 month ago

The issue is with the readback logic in the demo:

tt_output_torch = ttnn.to_torch(
                tt_out_tok.cpu(blocking=False, cq_id=1), mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1)             <<<<<< THIS NON BLOCKING READ. THE DEVICE AND WORKER THREADS ARE NOT GUARANTEED TO POPULATE THIS TENSOR IMMEDIATELY
            )[0, 0, 0, :batch_size]                <<< INDEXING INTO A POTENTIALLY EMPTY TENSOR, WHICH WILL GET POPULATED ASYNCHRONOUSLY
            ttnn.record_event(1, write_event)

            # Save output token to print out later
            for user in range(batch_size):
                user_tok = tt_output_torch[user].tolist().  <<<<< NO GUARANTEE OUTPUT IS POPULATED HERE EITHER
                if user_tok != 28803 and user_done[user] == False:  # Stop saving the ouput after hitting the EOS token
                    all_outputs[user].append(user_tok) 
                else:
                    user_done[user] = True
                    logger.trace(f"[User {user}] Finished decoding at iteration {iteration}")
                    if all(user_done):
                        users_decoding = False

Since reads are non-blocking, there is no guarantee that the output tensor is populated by the time its read in this setup. Its up to the user to ensure that the device and worker threads have been blocked on before indexing into the output.

Changing the read to be blocking resolves the segfault (ran for almost 2 hours and saw no issues). @cglagovichTT I've pushed a change to https://github.com/tenstorrent/tt-metal/tree/refs/heads/asaigal/14475 and can make a PR if this solution is okay (seeing 90-93t/s/u with blocking reads).

A proper solution allowing non-blocking reads for LLMs would be as follows:

Run decode for n iterations in a nonblocking manner and append your outputs to a list. The outputs will not be synchronously populated.
At the end of each iteration, record an event, which also gets appended to a list. This leads to an initial latency.
For every iteration after the n'th iteration, issue a ttnn.event_synchronize(event_list[iteration_count - n]). This is a host to device sync, where host waits for output iteration_count - n to get populated (at iteration n this will be the first token).
After the sync, log the token/process it (since it is guaranteed to be populated at this point) and then launch the next inference step to generate token n + 1.

This way, after the warmup iterations, you'll have inputs buffered up and will always be dispatching inference requests n tokens ahead. If n is large enough, the synchronize call will just be a nop thats there for safety.

mtairum commented 4 weeks ago

Thanks @tt-asaigal !

We'll measure perf with blocking and with the output buffer to make a decision. Depending on the necessary number of iterations for proper buffering the second option might be a problem, as it will affect the time to first token.

However, for an immediate fix before we complete a perf analysis I think we can push the blocking. Thoughts @cglagovichTT @yieldthought ?

cglagovichTT commented 4 weeks ago

I'm in favor of merging the blocking PR to solve the segfault then implement Aditya's proper solution

mtairum commented 4 weeks ago

I will include Aditya's fix in the llama-70b PR (which is now done and under revision).

jvasilje commented 4 weeks ago

why is this still a P0? @cglagovichTT @yieldthought

cglagovichTT commented 4 weeks ago

Changing to P1 since there's a resolution on branch. @mtairum please close once you merge https://github.com/tenstorrent/tt-metal/pull/14273

mtairum commented 3 weeks ago

Blocking fix from Aditya was merged to main. We will monitor the state of CI in the coming days/weeks to see if the issues are still present after the fix.

tt-rkim commented 3 weeks ago

Thanks bossmans @tt-asaigal @mtairum @cglagovichTT

tenstorrent / tt-metal

Investigate failure of llama demo in single-card demos #14475