mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.9k stars 1.55k forks source link

[Bug] Failure to load model on macOS with error about TVM #2882

Closed vlbosch closed 1 month ago

vlbosch commented 1 month ago

🐛 Bug

After converting Mistral-Large-2407 and trying to load the model for chatting or serving, the following error is presented:

"(mlc-llm) USER@MBPM3MVLB ~ % mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --port 9999 [2024-09-05 13:46:03] INFO auto_device.py:88: Not found device: cuda:0 [2024-09-05 13:46:04] INFO auto_device.py:88: Not found device: rocm:0 [2024-09-05 13:46:05] INFO auto_device.py:79: Found device: metal:0 [2024-09-05 13:46:05] INFO auto_device.py:88: Not found device: vulkan:0 [2024-09-05 13:46:06] INFO auto_device.py:88: Not found device: opencl:0 [2024-09-05 13:46:06] INFO auto_device.py:35: Using device: metal:0 [2024-09-05 13:46:06] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-05 13:46:06] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/3826dfed383847636248c8e5e540102b.dylib [2024-09-05 13:46:06] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory. [2024-09-05 13:46:06] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [2024-09-05 13:46:06] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server". [13:46:06] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:687: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. [13:46:06] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:687: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [13:46:06] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:687: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [13:46:06] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:768: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048. [13:46:06] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:773: Estimated total single GPU memory usage: 70063.542 MB (Parameters: 65776.148 MB. KVCache: 2969.526 MB. Temporary buffer: 1317.867 MB). The actual usage might be slightly larger than the estimated number. Exception in thread Thread-1: Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/mlc-llm/lib/python3.12/threading.py", line 1073, in _bootstrap_inner self.run() File "/opt/homebrew/Caskroom/miniconda/base/envs/mlc-llm/lib/python3.12/threading.py", line 1010, in run self._target(*self._args, **self._kwargs) File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL File "/opt/homebrew/Caskroom/miniconda/base/envs/mlc-llm/lib/python3.12/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): File "/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/include/tvm/runtime/packed_func.h", line 649 InternalError: Check failed: typecode == kTVMPackedFuncHandle (0 vs. 10) : expected FunctionHandle but got int"

To Reproduce

Steps to reproduce the behavior:

  1. Try to load the model with mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --port 9999
  2. Model doesn't load and exits with error
  3. Must manually terminate python proces, Control+C doesn't work for quitting the program

Expected behavior

The model loads and can be served.

Environment

Additional context

Converted the model myself with mlc_llm convert_weight /Users/USER/LLM/Mistral-Large-Instruct-2407 --quantization q4f16_1 --output /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC Followed by: mlc_llm gen_config /Users/USER/LLM/Mistral-Large-Instruct-2407 --quantization q4f16_1 --output /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --conv-template mistral_default

MasterJH5574 commented 1 month ago

Apologize for the inconvenience. The latest nightly packages have fixed the issue. You may need to use environment variable MLC_JIT_POLICY=REDO python -m mlc_llm chat ... to force the automatic model recompilation after upgrade.

vlbosch commented 1 month ago

Thanks for the quick solution! I can confirm that the chat and serve commands do work with the latest nightly.