mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.08k stars 1.56k forks source link

[Bug] InternalError: Check failed: sampled_index >= 0 (-1 vs. 0) #771

Closed daniel-kukiela closed 1 year ago

daniel-kukiela commented 1 year ago

🐛 Bug

Compiled Llama2 Chat HF 70B breaks if the prompt is greater than 2k tokens:

Traceback (most recent call last):
  File "/mnt/data/psyber.io/tests/sample_mlc_chat.py", line 55, in <module>
    output = cm.generate(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlc_chat/chat_module.py", line 650, in generate
    self._prefill(prompt)
  File "/usr/local/lib/python3.11/dist-packages/mlc_chat/chat_module.py", line 819, in _prefill
    self._prefill_func(input, decode_next_token, place_in_prompt.value)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 262, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 251, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
tvm.error.InternalError: Traceback (most recent call last):
  8: TVMFuncCall
  7: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /workspace/mlc-llm/cpp/llm_chat.cc:1083
  6: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt)
        at /workspace/mlc-llm/cpp/llm_chat.cc:621
  5: mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, float, float)
        at /workspace/mlc-llm/cpp/llm_chat.cc:776
  4: mlc::llm::LLMChat::SampleFromProbOnCPU()
        at /workspace/mlc-llm/cpp/llm_chat.cc:931
  3: _ZN3tvm7runtime13PackedFun
  2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int (*)(tvm::runtime::NDArray, double, double)>(int (*)(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 421
InternalError: Check failed: sampled_index >= 0 (-1 vs. 0) :

If the prompt is right below 2k tokens and the model crosses 2k tokens during generation, this error occurs:

Traceback (most recent call last):
  File "/mnt/data/psyber.io/tests/sample_mlc_chat.py", line 53, in <module>
    output = cm.generate(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/mlc_chat/chat_module.py", line 661, in generate
    self._decode()
  File "/usr/local/lib/python3.11/dist-packages/mlc_chat/chat_module.py", line 856, in _decode
    self._decode_func()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 331, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 262, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 251, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 181, in tvm._ffi._cy3.core.CHECK_CALL
tvm.error.InternalError: Traceback (most recent call last):
  7: TVMFuncCall
  6: mlc::llm::LLMChat::DecodeStep()
        at /workspace/mlc-llm/cpp/llm_chat.cc:640
  5: mlc::llm::LLMChat::SampleTokenFromLogits(tvm::runtime::NDArray, float, float)
        at /workspace/mlc-llm/cpp/llm_chat.cc:776
  4: mlc::llm::LLMChat::SampleFromProbOnCPU()
        at /workspace/mlc-llm/cpp/llm_chat.cc:931
  3: _ZN3tvm7runtime13PackedFun
  2: tvm::runtime::TypedPackedFunc<int (tvm::runtime::NDArray, double, double)>::AssignTypedLambda<int (*)(tvm::runtime::NDArray, double, double)>(int (*)(tvm::runtime::NDArray, double, double), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  1: tvm::runtime::relax_vm::SampleTopPFromProb(tvm::runtime::NDArray, double, double)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/relax_vm/lm_support.cc", line 421
InternalError: Check failed: sampled_index >= 0 (-1 vs. 0) :

Llama2 has a context length of 4k tokens and I compiled the model using:

python3 -m mlc_llm.build --model dist/models/Llama-2-70b-chat-hf --target cuda --quantization q4f16_1 --max-seq-len 4096

To Reproduce

Steps to reproduce the behavior:

  1. Compile the model for 4k context length: python3 -m mlc_llm.build --model dist/models/Llama-2-70b-chat-hf --target cuda --quantization q4f16_1 --max-seq-len 4096
  2. Use a big enough prompt so it exceeds 2k tokens for one error and big enough prompt that creates almost 2k tokens for the other error

Expected behavior

The code works and generated output.

Environment

tqchen commented 1 year ago

We have updated the overflow support so this should be ok now

tqchen commented 1 year ago

We have updated the overflow support so this should be ok now