Closed EmilioZhao closed 5 months ago
+1ć
Iām working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support
Iām working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support
hi @vinx13
Will the tree decoding kernel be released next week or will it take longer?
Iām working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support
Glad to hear that @vinx13 and thanks a bunch for your quick reply! Look forward to seeing your pull request. Are you also working on the Tree-based attention?
Initial support for Medusa is added in #2337 , tree decoding is not yet supported as more work is required
Thanks a lot! We'll try Medusa list decoding first.
Hi @vinx13, I'm wondering whether you've done any tests with your Medusa and EAGLE implementation to gauge expected performance improvement? If anyone else here has done tests, I'd love to hear about any results, especially for larger models.
This leaderboard indicates that EAGLE currently gives the biggest speed boost, but there are some newer (though not necessarily better) approaches which aren't listed there - e.g.
And:
Gonna close this issue as the initial Medusa support is there. Please open new issues for further questions :-)
š Feature
Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device. refers to: https://github.com/FasterDecoding/Medusa/tree/main Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training.
Motivation
Medusa is an excellent solution for speeding up LLM decoding 2.2~3.6x without affecting original model performance. It solved problems of current speculative decoding such as requirement of a good draft model, complex system and inefficiency when using sampling-based generation.
TVM or MLC-LLM aims to deploy models everywhere especially on mobile devices which requires excellent memory management and cost efficient inference with extremely limited resources. Therefore, implementing such a speedup monster would greatly enhance the impact and visibility of MLC-LLM.
Alternatives
Speculative decoding with draft model like Eagle, but it requires careful training of draft model for performance.
Additional context
We've tried to implement it on MLC-LLM but found that it's rather difficult to implement "Tree-based Attention" and "KV cache update part" with current MLC-LLM complicated code structure. Therefore we resort to the community.