mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.26k stars 1.58k forks source link

[Bug] Inference with llava throws an error #3039

Open HoiM opened 4 days ago

HoiM commented 4 days ago

🐛 Bug

I am trying to run llava with mlc-llm. On both a linux server machine and a local MacOS, I encountered this error:

(run export RUST_BACKTRACE=full before running the inference program.)

[2024-11-21 14:48:31] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-21 14:48:31] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-21 14:48:32] INFO auto_device.py:79: Found device: metal:0
[2024-11-21 14:48:32] INFO auto_device.py:88: Not found device: vulkan:0
[2024-11-21 14:48:33] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-21 14:48:33] INFO auto_device.py:35: Using device: metal:0
[2024-11-21 14:48:33] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-11-21 14:48:33] INFO jit.py:158: Using cached model lib: /Users/yuhaiming/.cache/mlc_llm/model_lib/844b459aad26bf51753183241229d8bb.dylib
[2024-11-21 14:48:33] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-11-21 14:48:33] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-11-21 14:48:33] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
thread '<unnamed>' panicked at src/lib.rs:26:50:
called `Result::unwrap()` on an `Err` value: Error("data did not match any variant of untagged enum ModelWrapper", line: 277157, column: 1)
stack backtrace:
   0:        0x104d41e78 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hade97c44b56fc870
   1:        0x104d8f6fc - core::fmt::write::h81cbefbffc581dab
   2:        0x104d51d64 - std::io::Write::write_fmt::h125c60058ebfe43c
   3:        0x104d41cb4 - std::sys_common::backtrace::print::hfa54be0dd0cf5860
   4:        0x104d61794 - std::panicking::default_hook::{{closure}}::h4235e0929057f079
   5:        0x104d61514 - std::panicking::default_hook::hcf67171e7c25be94
   6:        0x104d61c54 - std::panicking::rust_panic_with_hook::h1767d40d669aa9fe
   7:        0x104d42668 - std::panicking::begin_panic_handler::{{closure}}::h83ff281d56dc913c
   8:        0x104d42090 - std::sys_common::backtrace::__rust_end_short_backtrace::h2f399e8aa761a4f1
   9:        0x104d619e8 - _rust_begin_unwind
  10:        0x104dc460c - core::panicking::panic_fmt::hc32404f2b732859f
  11:        0x104dc44b4 - core::result::unwrap_failed::h2ea3b6e22f1f6a7c
  12:        0x104b0448c - _tokenizers_new_from_str
  13:        0x104af69d4 - __ZN10tokenizers9Tokenizer12FromBlobJSONERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE
  14:        0x104aea804 - __ZN3mlc3llm9Tokenizer8FromPathERKN3tvm7runtime6StringENSt3__18optionalINS0_13TokenizerInfoEEE
  15:        0x104af1b98 - __ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_15TypedPackedFuncIFN3mlc3llm9TokenizerERKNS0_6StringEEE17AssignTypedLambdaINS6_3$_3EEEvT_NSt3__112basic_stringIcNSG_11char_traitsIcEENSG_9allocatorIcEEEEEUlRKNS0_7TVMArgsEPNS0_11TVMRetValueEE_EEE4CallEPKS1_SN_SR_
  16:        0x10e01e710 - _TVMFuncCall
  17:        0x101f37910 - __ZL39__pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCallPvP7_objectP8TVMValuePi
  18:        0x101f370dc - __ZL76__pyx_pw_3tvm_4_ffi_4_cy3_4core_10ObjectBase_3__init_handle_by_constructor__P7_objectS0_S0_
  19:        0x1018ce11c - _PyCFunction_GetFlags
  20:        0x10188ac34 - __PyObject_MakeTpCall
  21:        0x10195edec - __PyEval_EvalFrameDefault
  22:        0x10195be48 - __PyEval_EvalFrameDefault
  23:        0x10188b474 - __PyFunction_Vectorcall
  24:        0x10188aa5c - __PyObject_FastCallDictTstate
  25:        0x10188b7c0 - __PyObject_Call_Prepend
  26:        0x1018efa9c - __PyType_Lookup
  27:        0x1018e70f8 - __PyType_Lookup
  28:        0x10188ac34 - __PyObject_MakeTpCall
  29:        0x10195edec - __PyEval_EvalFrameDefault
  30:        0x10195bec4 - __PyEval_EvalFrameDefault
  31:        0x10195fb40 - __PyEval_EvalFrameDefault
  32:        0x10188b3bc - __PyFunction_Vectorcall
  33:        0x10188d58c - _PyMethod_New
  34:        0x10195ed8c - __PyEval_EvalFrameDefault
  35:        0x10195bf40 - __PyEval_EvalFrameDefault
  36:        0x10195fb40 - __PyEval_EvalFrameDefault
  37:        0x10188b3bc - __PyFunction_Vectorcall
  38:        0x10188a9d8 - __PyObject_FastCallDictTstate
  39:        0x10188b7c0 - __PyObject_Call_Prepend
  40:        0x1018efa9c - __PyType_Lookup
  41:        0x1018e70f8 - __PyType_Lookup
  42:        0x10188ac34 - __PyObject_MakeTpCall
  43:        0x10195edec - __PyEval_EvalFrameDefault
  44:        0x10195bf40 - __PyEval_EvalFrameDefault
  45:        0x10195fb40 - __PyEval_EvalFrameDefault
  46:        0x101956504 - _PyEval_EvalCode
  47:        0x10199a7d8 - _PyParser_ASTFromStringObject
  48:        0x10199a9ac - _PyRun_FileExFlags
  49:        0x101998ad4 - _PyRun_SimpleFileExFlags
  50:        0x1019b5d08 - _Py_RunMain
  51:        0x1019b6178 - _Py_Main
  52:        0x1019b6218 - _Py_BytesMain
fatal runtime error: failed to initiate panic, error 5
zsh: abort      /usr/bin/python3 run-llava.py

To Reproduce

Steps to reproduce the behavior:

  1. Installing packages:

On MacOS:

pip install mlc_ai_cpu-0.17.1-cp39-cp39-macosx_13_0_arm64.whl
pip install mlc_llm_cpu-0.17.1-cp39-cp39-macosx_13_0_arm64.whl 

On Linux:

pip install mlc_ai_cu123-0.17.2-cp310-cp310-manylinux_2_28_x86_64.whl 
pip install mlc_llm_cu123-0.17.2-cp310-cp310-manylinux_2_28_x86_64.whl
  1. convert and compile model
mlc_llm convert_weight --model-type llava ../hub/llava-hf/llava-1.5-7b-hf --quantization q4f16_1 -o llava-1.5-7b-hf-mlc
mlc_llm gen_config ../hub/llava-hf/llava-1.5-7b-hf --quantization q4f16_1  --conv-template llava -o llava-1.5-7b-hf-mlc
mlc_llm compile llava-1.5-7b-hf-mlc/mlc-chat-config.json --device cuda -o llava-1.5-7b-hf-mlc/llava-1.5-7b-q4f16_1-cuda.so # or for macos
  1. run the model
    
    from mlc_llm import MLCEngine
    import PIL.Image
    from io import BytesIO
    import base64

model = "/path/to/llava-1.5-7b-hf-mlc" model_lib = "/path/to/llava-1.5-7b-hf-mlc/llava-1.5-7b-q4f16_1-cuda.so" image_path = "/path/to/image.jpg" engine = MLCEngine(model=model, model_lib=model_lib)

img = PIL.Image.open(image_path) img_resized = img.resize((336, 336))

img_byte_arr = BytesIO() img_resized.save(img_byte_arr, format="JPEG") img_byte_arr = img_byte_arr.getvalue()

new_url = ( f"data:image/jpeg;base64,{base64.b64encode(img_byte_arr).decode('utf-8')}" )

for response in engine.chat.completions.create( messages=[{ "role": "user", "content": [ { "type": "image_url", "image_url": new_url, }, { "type": "text", "text": "What is shown in this image?", }, ], }], model=model, stream=True, ): for choice in response.choices: print(choice.delta.content, end="", flush=True)

engine.terminate()

MasterJH5574 commented 4 days ago

HI @HoiM, could you try to install the latest nightly packages of mlc-llm and mlc-ai? We fixed the issue last week in https://github.com/mlc-ai/mlc-llm/commit/347407375474e99dcb14647299853c7e1263c008, which is not yet included in a stable release. You can find the nightly package installation instructions at https://llm.mlc.ai/docs/install/mlc_llm.html.

HoiM commented 3 days ago

On my MacOS, reinstalling the following two wheels solved the problem.

pip install mlc_ai_nightly_cpu-0.18.dev249-cp39-cp39-macosx_13_0_arm64.whl
pip install mlc_llm_nightly_cpu-0.18.dev71-cp39-cp39-macosx_13_0_arm64.whl

THX!