Closed zhyncs closed 1 month ago
You can build the two models separately. For eagle or medusa, you can download the separate eagle or medusa head, and specify --model-type eagle / medusa
when building the model.
After that, you can use the --speculative-mode
parameter when starting the server
e.g
python3 -m mlc_llm serve dist/llama-3-8b-chat-hf-f16-1gpu/ --mode server --model-lib dist/llama-3-8b-chat-hf-f16-1gpu/llama-3-8b-chat-hf-f16.so --additional-models dist/llama-3-1b-chat-hf-f16-1gpu,dist/llama-3-1b-chat-hf-f16-1gpu/llama-3-1b-chat-hf.so --speculative-mode small_draft --overrides "spec_draft_length=1"
❓ General Questions
@MasterJH5574 @vinx13 Hi MLC LLM's developers, I see that various speculative decoding algorithms such as small draft model, Medusa, and EAGLE have been implemented in MLC LLM. If I want to use them now, are there any corresponding instructions? I couldn't find the relevant section at https://llm.mlc.ai/docs/. Looking forward to your reply, thanks!