mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.09k stars 1.56k forks source link

[Bug] Mixtral not able to run on Nvidia Jetson #1752

Closed raj-khare closed 8 months ago

raj-khare commented 8 months ago

🐛 Bug

I'm trying run to Mixtral 8-7B model on Jetson AGX (aarch64, sm_87). But getting the following error:

root@tegra-ubuntu:/# python3 /opt/mlc-llm/benchmark.py --model /data/models/mlc/dist/mixtral-4bit/ --prompt /data/prompts/completion_16.json --max
-new-tokens 128 
Namespace(model='/data/models/mlc/dist/mixtral-4bit/', prompt=['/data/prompts/completion_16.json'], chat=False, streaming=False, max_new_tokens=128, max_num_prompts=None, save='')
-- loading /data/models/mlc/dist/mixtral-4bit/
[2024-02-14 10:49:26] INFO auto_device.py:76: Found device: cuda:0
[2024-02-14 10:49:27] INFO auto_device.py:85: Not found device: rocm:0
[2024-02-14 10:49:28] INFO auto_device.py:85: Not found device: metal:0
[2024-02-14 10:49:29] INFO auto_device.py:85: Not found device: vulkan:0
[2024-02-14 10:49:30] INFO auto_device.py:85: Not found device: opencl:0
[2024-02-14 10:49:30] INFO auto_device.py:33: Using device: cuda:0
[2024-02-14 10:49:30] INFO chat_module.py:370: Using model folder: /data/models/mlc/dist/mixtral-4bit
[2024-02-14 10:49:30] INFO chat_module.py:371: Using mlc chat config: /data/models/mlc/dist/mixtral-4bit/mlc-chat-config.json
[2024-02-14 10:49:30] INFO chat_module.py:513: Using library model: /data/models/mlc/dist/mixtral-4bit/None.so
[2024-02-14 10:49:31] INFO model_metadata.py:95: Total memory usage: 26206.65 MB (Parameters: 25053.70 MB. KVCache: 0.00 MB. Temporary buffer: 1152.95 MB)
[2024-02-14 10:49:31] INFO model_metadata.py:104: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`

PROMPT:  Once upon a time, there was a little girl who loved to read.

Traceback (most recent call last):
  File "/opt/mlc-llm/benchmark.py", line 127, in <module>
    print(cm.benchmark_generate(prompt=prompt, generate_length=args.max_new_tokens).strip())
  File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 977, in benchmark_generate
    self._prefill(prompt)
  File "/usr/local/lib/python3.10/dist-packages/mlc_chat/chat_module.py", line 1078, in _prefill
    self._prefill_func(
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x1f0) [0xffff6bb0c050]
  [bt] (7) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x208) [0xffff6bb0bc68]
  [bt] (6) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x68c) [0xffff6bb0d2bc]
  [bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x84) [0xffff6bb09fb4]
  [bt] (4) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x30b8ca4) [0xffff6bac8ca4]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x307ac6c) [0xffff6ba8ac6c]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(+0x307aa68) [0xffff6ba8aa68]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff69c336a8]
  [bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff6ba8d050]
  File "/opt/mlc-llm/3rdparty/tvm/src/runtime/library_module.cc", line 78
TVMError: Assert fail: T.tvm_struct_get(indptr_handle, 0, 5, "uint8") == T.uint8(0) and T.tvm_struct_get(indptr_handle, 0, 6, "uint8") == T.uint8(32) and T.tvm_struct_get(indptr_handle, 0, 7, "uint16") == T.uint16(1), dequantize_group_gemm.indptr_handle.dtype is expected to be int32

My chat config

cfg = ChatConfig(max_gen_len=args.max_new_tokens, context_window_size=4096, prefill_chunk_size=4096, sliding_window_size=1024)

if not args.chat:
    cfg.conv_template = 'LM'

cm = ChatModule(model="/data/models/mlc/dist/mixtral-4bit", model_lib_path="/data/models/mlc/dist/mixtral-4bit/None.so", chat_config=cfg)

To Reproduce

Steps to reproduce the behavior:

I have compiled MLC LLM with the following FLAGS:

cmake -G Ninja \
     -DCMAKE_CXX_STANDARD=17 \
    -DCMAKE_CUDA_STANDARD=17 \
    -DCMAKE_CUDA_ARCHITECTURES=${CUDAARCHS} \
     -DUSE_CUDA=ON \
     -DFLASHINFER_CUDA_ARCHITECTURES=87 \
    -DUSE_FLASHINFER=ON \
    -DUSE_CUDNN=ON \
    -DUSE_CUBLAS=ON \
    -DUSE_CURAND=ON \
    -DUSE_CUTLASS=ON \
    -DUSE_THRUST=ON \
    -DUSE_GRAPH_EXECUTOR_CUDA_GRAPH=ON \
    -DUSE_STACKVM_RUNTIME=ON \
    -DUSE_LLVM="/usr/bin/llvm-config --link-static" \
    -DHIDE_PRIVATE_SYMBOLS=ON \
    -DSUMMARIZE=ON

Expected behavior

Model should run without any issue.

Environment

Any help is highly appreciated!

anibohara2000 commented 8 months ago

Fixed by PR #1778