[Bug] Speculative decoding with 2 additional models

bethalianovike commented 2 months ago

🐛 Bug

❓ General Questions

Based on https://llm.mlc.ai/docs/deploy/rest.html#id5, we can use more than 1 additional models as we use speculative decoding mode. But when get response via rest API post, I get the following error message.

[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so
INFO:     Started server process [1353236]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO:     127.0.0.1:39658 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
    background_engine_->Step();
              ^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/cpp/serve/engine.cc", line 629, in mlc::llm::serve::EngineImpl::Step()
    CHECK(request_stream_callback_ != nullptr)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
tvm.error.InternalError: Traceback (most recent call last):
  1: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /home/mlc-llm/cpp/serve/threaded_engine.cc:182
  0: mlc::llm::serve::EngineImpl::Step()
        at /home/mlc-llm/cpp/serve/engine.cc:629
  File "/home/mlc-llm/cpp/serve/engine.cc", line 640
InternalError: Check failed: (estate_->running_queue.empty()) is false: Internal assumption violated: It is expected that an engine step takes at least one action (e.g. prefill, decode, etc.) but it does not.

To Reproduce

Steps to reproduce the behavior:

python3 -m mlc_llm serve "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16" --model-lib "/home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so" --additional-models "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so" "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so" --mode "server" --speculative-mode "small_draft" --port 8001

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "model":  "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16",
        "additional-models": ["/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1", "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16"],
        "messages": [
            {"role": "user", "content": "What is Alaska famous of? Please elaborate in detail."}
        ]
  }' \
  http://127.0.0.1:8001/v1/chat/completions

Expected behavior

Generate the response.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.5
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 4090
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source
Python version (e.g. 3.10): 3.11
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
Any other relevant information:

Additional context

MasterJH5574 commented 2 months ago

Thank you @bethalianovike for reporting. Though the interface supports passing in multiple additional models, we only support one additional model for spec decoding right now. We will update the documentation to avoid this confusion.

MasterJH5574 commented 2 months ago

Updated docs in #2841. Multiple additional models is planned as a future feature to support.

mlc-ai / mlc-llm