[Bug] Llama2 InternalError: Check failed: (!output_ids_.empty()) is false

zhangxiao-stack commented 1 year ago

🐛 Bug

File "/public/llm_chat.cc", line 791 InternalError: Check failed: (!outputids.empty()) is false:

To Reproduce

Steps to reproduce the behavior:

pyton3 chat.py
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q0f16")
# You can change to other models that you downloaded, for example,
# cm = ChatModule(model="Llama-2-13b-chat-hf-q4f16_1")  # Llama2 13b model

output = cm.generate(
   prompt="What is the meaning of life?",
   progress_callback=StreamToStdout(callback_interval=5),
)

# Print prefill and decode performance statistics
print(f"Statistics: {cm.stats()}\n")

output = cm.generate(
   prompt="How many points did you list out?",
   progress_callback=StreamToStdout(callback_interval=5),
)

Statistics: prefill: 64.5 tok/s, decode: 96.2 tok/s

Traceback (most recent call last):
  File "chat.py", line 20, in <module>
    output = cm.generate(
  File "/usr/local/lib/python3.8/dist-packages/mlc_chat-0.0.0-py3.8-linux-x86_64.egg/mlc_chat/chat_module.py", line 650, in generate
    self._decode()
  File "/usr/local/lib/python3.8/dist-packages/mlc_chat-0.0.0-py3.8-linux-x86_64.egg/mlc_chat/chat_module.py", line 845, in _decode
    self._decode_func()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.8/dist-packages/tvm-0.12.dev1610+gceaf7b015-py3.8-linux-x86_64.egg/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
    raise py_err
  File "/public/llm_chat.cc", line 1272, in mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#8}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  File "/public/llm_chat.cc", line 790, in mlc::llm::LLMChat::DecodeStep()
  File "/public/llm_chat.cc", line 791, in mlc::llm::LLMChat::DecodeStep()
tvm.error.InternalError: Traceback (most recent call last):
  2: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#8}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /public/llm_chat.cc:1272
  1: mlc::llm::LLMChat::DecodeStep()
        at /public/llm_chat.cc:790
  0: mlc::llm::LLMChat::DecodeStep()
        at /public/llm_chat.cc:791
  File "/public/llm_chat.cc", line 791
InternalError: Check failed: (!output_ids_.empty()) is false:

Expected behavior

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):cuda
Operating system (e.g. Ubuntu/Windows/MacOS/...):ubuntu
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...):Tesla V100S-PCIE-32GB
How you installed MLC-LLM (conda, source):source
How you installed TVM-Unity (pip, source):source
Python version (e.g. 3.10):3.8.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable): v11.8
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): USE_NVTX: OFF USE_GTEST: AUTO SUMMARIZE: OFF USE_IOS_RPC: OFF USE_MSC: OFF USE_ETHOSU: OFF CUDA_VERSION: 11.8 USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: OFF USE_OPENCL_GTEST: /path/to/opencl/gtest USE_MKL: OFF USE_PT_TVMDSOOP: OFF MLIR_VERSION: NOT-FOUND USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_VITIS_AI: OFF USE_MLIR: OFF USE_RCCL: OFF USE_LLVM: llvm-config --ignore-libllvm --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_NCCL: OFF USE_ROCBLAS: OFF GIT_COMMIT_HASH: ceaf7b0156524d30537a3de5fa30764eaff4edb8 USE_VULKAN: OFF USE_RUST_EXT: OFF USE_CUTLASS: ON USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2023-09-18 20:10:22 -0400 USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: ON USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 15.0.7AOMP USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: none USE_BNNS: OFF USE_CUBLAS: OFF USE_METAL: OFF USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON
mlc-llm config: set(TVM_HOME 3rdparty/tvm) set(CMAKE_BUILD_TYPE RelWithDebInfo) set(USE_CUDA ON) set(USE_CUTLASS OFF) set(USE_CUBLAS OFF) set(USE_ROCM OFF) set(USE_VULKAN OFF) set(USE_METAL OFF) set(USE_OPENCL OFF)

Additional context

CharlieFRuan commented 1 year ago

Hi @zhangxiao-stack, thanks for reporting! This seems to be a weird issue... I wasn't able to replicate it on my end. Besides, this tutorial (runnable on Colab) which uses the most recent mlc packages works fine as well.

Would you mind reinstalling the packages and retry? https://mlc.ai/package/

zhangxiao-stack commented 1 year ago

@CharlieFRuan , thanks for the reply,I reinstall the mlc_chat* packages with Source Code follows:

step1:

git branch -v
* main 5790c74 [Docs] README revamp (#980)
mkdir build
python3 ../cmake/gen_cmake_config.py
cmake .. && cmake --build . --parallel $(nproc) && cd ..

step2:

python3 build.py --model Llama-2-7b-chat-hf --hf-path ./dist/models/Llama-2-7b-chat-hf --quantization q0f16 --target cuda

step3:

 ./build/mlc_chat_cli --model Llama-2-7b-chat-hf-q0f16 --device cuda

/INST]: [02:55:57] /root/mlc-llm/cpp/llm_chat.cc:791: InternalError: Check failed: (!output_ids_.empty()) is false:

reinstall the mlc_chat* packages with pip wheel follows: step1:

download mlc_chat_nightly_cu118-0.1.dev476-cp38-cp38-manylinux_2_28_x86_64.whl
pip install  mlc_chat_nightly_cu118-0.1.dev476-cp38-cp38-manylinux_2_28_x86_64.whl

step2:

from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# From the mlc-llm directory, run
# $ python sample_mlc_chat.py

# Create a ChatModule instance
cm = ChatModule(model="Llama-2-7b-chat-hf-q0f16")
# You can change to other models that you downloaded, for example,
# cm = ChatModule(model="Llama-2-13b-chat-hf-q4f16_1")  # Llama2 13b model

output = cm.generate(
   prompt="What is the meaning of life?",
   progress_callback=StreamToStdout(callback_interval=5),
)
print(output)
# Print prefill and decode performance statistics
print(f"Statistics: {cm.stats()}\n")

output = cm.generate(
   prompt="How many points did you list out?",
   progress_callback=StreamToStdout(callback_interval=5),
)

print(f"Statistics: {cm.stats()}\n")
print(f"Generated text:\n{output}\n")

errors:
Statistics: prefill: 58.6 tok/s, decode: 85.4 tok/s

Traceback (most recent call last):
  File "chat.py", line 22, in <module>
    output = cm.generate(
  File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 663, in generate
    self._decode()
  File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 900, in _decode
    self._decode_func()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.8/dist-packages/tvm-0.12.dev1610+gceaf7b015-py3.8-linux-x86_64.egg/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
    raise py_err
  File "/workspace/mlc-llm/cpp/llm_chat.cc", line 837, in mlc::llm::LLMChat::DecodeStep()
tvm.error.InternalError: Traceback (most recent call last):
  0: mlc::llm::LLMChat::DecodeStep()
        at /workspace/mlc-llm/cpp/llm_chat.cc:837
  File "/workspace/mlc-llm/cpp/llm_chat.cc", line 837
InternalError: Check failed: (!output_ids_.empty()) is false:

CharlieFRuan commented 1 year ago

The repo does not seem up to date here:

git branch -v
* main 5790c74 [Docs] README revamp (#980)

You can pull again, and build from source.

For the prebuilt, mlc_chat_nightly_cu118-0.1.dev476-cp38-cp38-manylinux_2_28_x86_64.whl does not seem up to date either. Could you try:

pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

zhangxiao-stack commented 1 year ago

@CharlieFRuan I update the mlc-llm sourcecode

branch -v
* main 20131fb Update README.md (#1045)

step1: reinstall mlc-llm with sourcecode step2:build and run Llama-2-7b-chat-hf

 python3 -m mlc_llm.build --model Llama-2-7b-chat-hf --target cuda --quantization q0f16 --use-cache=0
 ./build/mlc_chat_cli --model Llama-2-7b-chat-hf-q0f16 --device cuda
****[INST]: hi
[/INST]: [02:07:08] /root/mlc-llm.latest/cpp/llm_chat.cc:818: InternalError: Check failed: (!output_ids_.empty()) is false:****

step2: build vicuna-7b-v1.1-q0f16 in same way ,no errors occurred

python3 -m mlc_llm.build --model vicuna-7b-v1.1 --target cuda --quantization q0f16 --use-cache=0
./build/mlc_chat_cli --model vicuna-7b-v1.1-q0f16 --device cuda
USER: hello
ASSISTANT: Hello! How can I help you today? Is there something you would like to talk about or ask me a question about? I'm here to assist you with any information or guidance you may need.

mlc-ai / mlc-llm