Closed 0xDEADFED5 closed 1 month ago
Hi @0xDEADFED5 sorry for the late reply. Speculative decoding works with two models, so only changing --speculative-mode
to small_model
won't work. Thanks for bringing this up, and we'll improve the error message to avoid the confusion here.
Here's an example command you could use to enable speculative decoding, which uses the 4-bit quantized Llama3 8B model to speculate the unquantized 8B model.
mlc_llm serve "HF://mlc-ai/Llama-3-8B-Instruct-q0f16-MLC" \
--additional-models "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" \
--speculative-mode "small_draft"
interesting! thanks for the reply
using nightly wheels. i can serve just fine with --speculative-mode disable, but all the other options give me this:
does speculative-mode have other requirements? OS: Windows 11, HW: Intel Arc A770 thanks for the great project, btw.