[Question] mlc_llm serve fails with --speculative-mode, does it require certain hardware?

0xDEADFED5 commented 2 months ago

using nightly wheels. i can serve just fine with --speculative-mode disable, but all the other options give me this:

Exception in thread Thread-11 (_background_loop):
Traceback (most recent call last):
  File "C:\Users\ANON\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "C:\Users\ANON\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\mlc_llm\serve\engine_base.py", line 482, in _background_loop
    self._ffi["run_background_loop"]()
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\tvm\_ffi\_ctypes\packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "C:\Users\ANON\repos\AI_Grotto\mlcvenv\Lib\site-packages\tvm\_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "D:\a\package\package\mlc-llm\cpp\serve\engine.cc", line 145
InternalError: Check failed: n->models_.size() > 1U (1 vs. 1) :

does speculative-mode have other requirements? OS: Windows 11, HW: Intel Arc A770 thanks for the great project, btw.

MasterJH5574 commented 1 month ago

Hi @0xDEADFED5 sorry for the late reply. Speculative decoding works with two models, so only changing --speculative-mode to small_model won't work. Thanks for bringing this up, and we'll improve the error message to avoid the confusion here.

Here's an example command you could use to enable speculative decoding, which uses the 4-bit quantized Llama3 8B model to speculate the unquantized 8B model.

mlc_llm serve "HF://mlc-ai/Llama-3-8B-Instruct-q0f16-MLC" \
  --additional-models "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" \
  --speculative-mode "small_draft"

0xDEADFED5 commented 1 month ago

interesting! thanks for the reply

mlc-ai / mlc-llm

[Question] mlc_llm serve fails with --speculative-mode, does it require certain hardware? #2350