Speculative mode for the LLaMA 3.1 70B model

shahizat commented 2 weeks ago

Greetings to all, I need your advice. I've been experimenting with speculative mode eager and medusa in order to increase the throughput during decoding phase, trying out different combinations of models. Could you please suggest which assistant model to choose for the LLaMA 3.1 70B model, considering the VRAM limit is 48GB(24GB each GPU) in tensor parallelism mode?

Thanks in advance for your help.

shahizat commented 1 week ago

Also interested in how to solve the problem of draft and target models with different vocabulary sizes. Does the MLC have a parameter similar to the one in vLLM called 'speculative-draft-tensor-parallel-size'?

MasterJH5574 commented 1 week ago

Hi @shahizat thanks for the question! Here is a related issue with the eagle and medusa spec decoding mode https://github.com/mlc-ai/mlc-llm/issues/2973#issuecomment-2408259291. You can use https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B as the eagle model for the 70B model, and I‘m not yet aware of a corresponding medusa model.

Also interested in how to solve the problem of draft and target models with different vocabulary sizes.

In my understanding speculative decoding requires both models to have the same vocabulary. Please correct me if I'm wrong, but MLC right now requires both models to have the same vocabulary size.

Does the MLC have a parameter similar to the one in vLLM called 'speculative-draft-tensor-parallel-size'?

Thanks, we don't yet support specifying different TP degrees for the draft and target models. So they will need to be the same (which is 2) for now. This is a great feature and we will follow up with the support.

mlc-ai / mlc-llm

Speculative mode for the LLaMA 3.1 70B model #3022