Open MrRace opened 1 month ago
Likely the self speculating models like eagle would help in this case
Likely the self speculating models like eagle would help in this case
@tqchen How to use the Eagle inference acceleration on an Android phone with MLC-LLM? Thanks a lot!
Likely the self speculating models like eagle would help in this case
@tqchen when use eagle, it seems need to train a draft model~
🚀 Feature
Please add Lookahead Decoding in mlc-llm in C++, we needed it to speedup LLM decoding on mobile device. refers to: https://github.com/hao-ai-lab/LookaheadDecoding
Motivation
Lookahead decoding provides a substantial reduction of latency, ranging from 1.5x to 2.3x with negligible computation overhead
TVM or MLC-LLM aims to deploy models widely, particularly on mobile devices, necessitating superb memory management and cost-effective inference with highly constrained resources. Implementing such a speedup solution would significantly boost the influence and prominence of MLC-LLM.
Alternatives
Speculative decoding with draft model, the memory on mobile devices is extremely limited and cannot afford to accommodate an additional draft model.