tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

[Llama3] Segfault on decode after running prefill in the same pytest #9224

Open avoraTT opened 4 weeks ago

avoraTT commented 4 weeks ago

When running the test_llama_model_t3000.py with the llama3 pytest parameter and the following order: "prefill_128", "decode", "prefill_2k", the decode tests results in a segfault when tiktoken modules calls _PyObject_GC_New () and led to a Tensor::deallocate call on device, which then shows with the following message (machine: sjc-nva-t3002):

Fatal Python error: Segmentation fault

Thread 0x00007f71e2ffd700 (most recent call first):
  File "/usr/lib/python3.8/threading.py", line 306 in wait
  File "/usr/lib/python3.8/threading.py", line 558 in wait
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f75757ca740 (most recent call first):
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/core.py", line 50 in __init__
  File "/home/avora/tt-metal/models/experimental/llama2_70b/reference/llama/llama/tokenizer3.py", line 78 in __init__
  File "/home/avora/tt-metal/models/experimental/llama2_70b/reference/llama/llama/generation.py", line 151 in build
  File "/home/avora/tt-metal/models/experimental/llama2_70b/tests/test_llama_model.py", line 80 in run_test_LlamaModel_inference
  File "/home/avora/tt-metal/models/experimental/llama2_70b/tests/test_llama_model_t3000.py", line 83 in test_LlamaModel_inference
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/avora/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/avora/tt-metal/python_env/bin/pytest", line 8 in <module>
Segmentation fault (core dumped)

Although the segfault takes place in the tokenizer, the gdb trace (gdb_trace_llama3_segfault_decode.txt) shows that there is a deallocation issue somewhere. We thought that this was a similar issue to #8965 , however, when applying the tensor deallocation fix by @aliuTT , our issue still persists.

Note: if we switch order of the tests from "prefill_128", "decode", "prefill_2k" to "decode", "prefill_128", "prefill_2k", the segfault goes away. This maybe points to an issue in maintaining states between different pytests?

avoraTT commented 4 weeks ago

@aliuTT @cglagovichTT

aliuTT commented 4 weeks ago

I'll take a look, can you paste the test cmd and commit/branch you're on?

avoraTT commented 4 weeks ago

ssh 10.230.36.208 (sjc-snva-t3002)

You can pull on main. The command is: pytest -svv models/experimental/llama2_70b/tests/test_llama_model_t3000.py .

On these lines in models/experimental/llama2_70b/tests/test_llama_model_t3000.py comment out ("llama2") and only use ("llama3") to speed things up.

updated screenshot:

image
aliuTT commented 4 weeks ago

Try this commit: e47709ed0. I wasn't able to get segfaults on sjc-snva-t3002 locally. Also, I'm done with the machine.

mikevin920 commented 4 weeks ago

Thanks! We will stress test this this locally as well

mikevin920 commented 4 weeks ago

After cherry picking the commit e47709ed0 above, we see this segfault locally on sjc-snva-t3002

#0  0x0000000007b25a70 in ?? ()
#1  0x00007fff888244dc in tt::WorkExecutor::push_work(std::shared_ptr<std::function<void ()> >, bool) () from /home/avora/tt-metal/build/lib/libtt_metal.so
#2  0x00007fff8881af5c in tt::tt_metal::Device::push_work(std::shared_ptr<std::function<void ()> >, bool) () from /home/avora/tt-metal/build/lib/libtt_metal.so
#3  0x00007fff88e13fc9 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::Tensor::deallocate(bool)::$_0&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&)>, std::tuple<std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&>, std::integer_sequence<unsigned long, 4ul> >::__visit_invoke(tt::tt_metal::Tensor::deallocate(bool)::$_0&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage>&) () from /home/avora/tt-metal/build/lib/libtt_eager.so
#4  0x00007fff88e0c022 in tt::tt_metal::Tensor::~Tensor() () from /home/avora/tt-metal/build/lib/libtt_eager.so
#5  0x00007fff89229212 in pybind11::class_<tt::tt_metal::Tensor>::dealloc(pybind11::detail::value_and_holder&) () from /home/avora/tt-metal/tt_eager/tt_lib/_C.so
#6  0x00007fff8910d59b in pybind11::detail::clear_instance(_object*) () from /home/avora/tt-metal/tt_eager/tt_lib/_C.so
#7  0x00007fff8910d154 in pybind11_object_dealloc () from /home/avora/tt-metal/tt_eager/tt_lib/_C.so
#8  0x00000000005b030c in ?? ()
#9  0x000000000058738d in ?? ()
#10 0x00000000005b030c in ?? ()
#11 0x000000000058738d in ?? ()
#12 0x00000000005cc0cb in ?? ()
#13 0x00000000005b030c in ?? ()
#14 0x00000000005835c2 in ?? ()
#15 0x00000000004c518f in ?? ()
#16 0x00000000005dca27 in ?? ()
#17 0x0000000000515e6a in _PyObject_GC_New ()
#18 0x00000000006b0403 in ?? ()
#19 0x00000000004e9618 in PyObject_GetIter ()
#20 0x00007fff30e65217 in pyo3::types::iterator::PyIterator::from_object () from /home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/_tiktoken.cpython-38-x86_64-linux-gnu.so
#21 0x00007fff30e6c1aa in pyo3::types::any::PyAny::iter () from /home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/_tiktoken.cpython-38-x86_64-linux-gnu.so
#22 0x00007fff30e6060f in pyo3::types::sequence::extract_sequence () from /home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/_tiktoken.cpython-38-x86_64-linux-gnu.so
#23 0x00007fff30e5eff3 in pyo3::conversions::std::map::<impl pyo3::conversion::FromPyObject for std::collections::hash::map::HashMap<K,V,S>>::extract () from /home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/_tiktoken.cpython-38-x86_64-linux-gnu.so
#24 0x00007fff30e5139b in _tiktoken::_::<impl pyo3::impl_::pyclass::PyMethods<_tiktoken::CoreBPE> for pyo3::impl_::pyclass::PyClassImplCollector<_tiktoken::CoreBPE>>::py_methods::ITEMS::trampoline ()
   from /home/avora/tt-metal/python_env/lib/python3.8/site-packages/tiktoken/_tiktoken.cpython-38-x86_64-linux-gnu.so
cglagovichTT commented 3 weeks ago

@mikevin920 @avoraTT are you still seeing this segfault locally?