mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.68k stars 1.52k forks source link

[Bug] Speculative decoding small draft doesn't work on macOS #2907

Open vlbosch opened 5 days ago

vlbosch commented 5 days ago

🐛 Bug

I tried to use Mistral Small 7B Instruct v0.3 as draft model for Mistral Large 2407. When not served using "--mode server", the model(s) never respond. I think that's because only CPU is used, instead of GPU as well. When serving with "--mode server", I see that the first token is streamed in the frontend, but then I get the following error: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false.

To Reproduce

Steps to reproduce the behavior:

  1. Download Mistral Large 2407
  2. Quantize model and gen config
  3. Run Mistral Large to see if it works standalone
  4. Run the speculative decoding with: python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server
  5. First token is streamed, then error message

USER@MBPM3MVLB ~ % python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server [2024-09-16 08:50:13] INFO auto_device.py:79: Found device: metal:0 [2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/3826dfed383847636248c8e5e540102b.dylib [2024-09-16 08:50:13] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC [2024-09-16 08:50:13] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO download_cache.py:166: Weights already downloaded: /Users/USER/.cache/mlc_llm/model_weights/hf/mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC [2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY [2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/7bbcaf068957bbf173dbd8ad644faea6.dylib [2024-09-16 08:50:13] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization). [2024-09-16 08:50:13] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local". [2024-09-16 08:50:13] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive". [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 80, max KV cache token capacity is 32768, prefill chunk size is 2048. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 86697.674 MB (Parameters: 69664.656 MB. KVCache: 15602.123 MB. Temporary buffer: 1430.894 MB). The actual usage might be slightly larger than the estimated number. [08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet. INFO: Started server process [69315] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:9999 (Press CTRL+C to quit) INFO: 127.0.0.1:58406 - "POST /v1/chat/completions HTTP/1.1" 200 OK libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [08:50:41] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine_actions/batch_draft.cc:151: InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false: Stack trace:

zsh: abort python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC

Expected behavior

The model streams the output to the provided prompt.

Environment

Additional context

Both models work fine separately.

MasterJH5574 commented 5 days ago

Thank you @vlbosch. We also ran into this and get it fixed in #2906. The nightly packages are under built and will be ready in a few hours. I'll report back when the nightly build is done.

MasterJH5574 commented 4 days ago

Hi @vlbosch the nightly wheel is updated and could you please try upgrade?

vlbosch commented 4 days ago

@MasterJH5574 Thanks for the quick response! I just updated to the latest nightly and retried. Small draft-mode does work now, however the speed running with small draft is slower than with Mistral Large alone. I thought that the baseline would be the regular speed of the large model? Or does that only count for the other speculative modes like eagle and medusa?