[Feature Request] Medusa support

EmilioZhao commented 6 months ago

🚀 Feature

Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device. refers to: https://github.com/FasterDecoding/Medusa/tree/main Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training.

Motivation

Medusa is an excellent solution for speeding up LLM decoding 2.2~3.6x without affecting original model performance. It solved problems of current speculative decoding such as requirement of a good draft model, complex system and inefficiency when using sampling-based generation.

TVM or MLC-LLM aims to deploy models everywhere especially on mobile devices which requires excellent memory management and cost efficient inference with extremely limited resources. Therefore, implementing such a speedup monster would greatly enhance the impact and visibility of MLC-LLM.

Alternatives

Speculative decoding with draft model like Eagle, but it requires careful training of draft model for performance.

Additional context

We've tried to implement it on MLC-LLM but found that it's rather difficult to implement "Tree-based Attention" and "KV cache update part" with current MLC-LLM complicated code structure. Therefore we resort to the community.

jpf888 commented 6 months ago

+1、

vinx13 commented 6 months ago

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

jpf888 commented 6 months ago

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

hi @vinx13

Will the tree decoding kernel be released next week or will it take longer?

EmilioZhao commented 6 months ago

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

Glad to hear that @vinx13 and thanks a bunch for your quick reply! Look forward to seeing your pull request. Are you also working on the Tree-based attention?

vinx13 commented 6 months ago

Initial support for Medusa is added in #2337 , tree decoding is not yet supported as more work is required

EmilioZhao commented 6 months ago

Thanks a lot! We'll try Medusa list decoding first.

josephrocca commented 5 months ago

Hi @vinx13, I'm wondering whether you've done any tests with your Medusa and EAGLE implementation to gauge expected performance improvement? If anyone else here has done tests, I'd love to hear about any results, especially for larger models.

This leaderboard indicates that EAGLE currently gives the biggest speed boost, but there are some newer (though not necessarily better) approaches which aren't listed there - e.g.

And:

https://github.com/apple/ml-recurrent-drafter

MasterJH5574 commented 5 months ago

Gonna close this issue as the initial Medusa support is there. Please open new issues for further questions :-)

mlc-ai / mlc-llm