mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.1k stars 1.57k forks source link

[Bug] Speculative decoding with 2 additional models #2801

Closed bethalianovike closed 2 months ago

bethalianovike commented 2 months ago

πŸ› Bug

❓ General Questions

Based on https://llm.mlc.ai/docs/deploy/rest.html#id5, we can use more than 1 additional models as we use speculative decoding mode. But when get response via rest API post, I get the following error message.

[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so
INFO:     Started server process [1353236]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO:     127.0.0.1:39658 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
    background_engine_->Step();
              ^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/cpp/serve/engine.cc", line 629, in mlc::llm::serve::EngineImpl::Step()
    CHECK(request_stream_callback_ != nullptr)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
tvm.error.InternalError: Traceback (most recent call last):
  1: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /home/mlc-llm/cpp/serve/threaded_engine.cc:182
  0: mlc::llm::serve::EngineImpl::Step()
        at /home/mlc-llm/cpp/serve/engine.cc:629
  File "/home/mlc-llm/cpp/serve/engine.cc", line 640
InternalError: Check failed: (estate_->running_queue.empty()) is false: Internal assumption violated: It is expected that an engine step takes at least one action (e.g. prefill, decode, etc.) but it does not.

To Reproduce

Steps to reproduce the behavior:

python3 -m mlc_llm serve "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16" --model-lib "/home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so" --additional-models "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so" "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so" --mode "server" --speculative-mode "small_draft" --port 8001
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "model":  "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16",
        "additional-models": ["/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1", "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16"],
        "messages": [
            {"role": "user", "content": "What is Alaska famous of? Please elaborate in detail."}
        ]
  }' \
  http://127.0.0.1:8001/v1/chat/completions

Expected behavior

Generate the response.

Environment

Additional context

MasterJH5574 commented 2 months ago

Thank you @bethalianovike for reporting. Though the interface supports passing in multiple additional models, we only support one additional model for spec decoding right now. We will update the documentation to avoid this confusion.

MasterJH5574 commented 2 months ago

Updated docs in #2841. Multiple additional models is planned as a future feature to support.