mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.24k stars 1.58k forks source link

[Question] instructions for speculative decoding usage #2973

Closed zhyncs closed 1 month ago

zhyncs commented 1 month ago

❓ General Questions

@MasterJH5574 @vinx13 Hi MLC LLM's developers, I see that various speculative decoding algorithms such as small draft model, Medusa, and EAGLE have been implemented in MLC LLM. If I want to use them now, are there any corresponding instructions? I couldn't find the relevant section at https://llm.mlc.ai/docs/. Looking forward to your reply, thanks!

vinx13 commented 1 month ago

You can build the two models separately. For eagle or medusa, you can download the separate eagle or medusa head, and specify --model-type eagle / medusa when building the model. After that, you can use the --speculative-mode parameter when starting the server

e.g

python3 -m mlc_llm serve dist/llama-3-8b-chat-hf-f16-1gpu/  --mode server --model-lib dist/llama-3-8b-chat-hf-f16-1gpu/llama-3-8b-chat-hf-f16.so --additional-models  dist/llama-3-1b-chat-hf-f16-1gpu,dist/llama-3-1b-chat-hf-f16-1gpu/llama-3-1b-chat-hf.so --speculative-mode small_draft --overrides "spec_draft_length=1"