[Feature Request] Lookahead Decoding support

MrRace commented 1 month ago

🚀 Feature

Please add Lookahead Decoding in mlc-llm in C++, we needed it to speedup LLM decoding on mobile device. refers to: https://github.com/hao-ai-lab/LookaheadDecoding

Motivation

Lookahead decoding provides a substantial reduction of latency, ranging from 1.5x to 2.3x with negligible computation overhead

TVM or MLC-LLM aims to deploy models widely, particularly on mobile devices, necessitating superb memory management and cost-effective inference with highly constrained resources. Implementing such a speedup solution would significantly boost the influence and prominence of MLC-LLM.

Alternatives

Speculative decoding with draft model, the memory on mobile devices is extremely limited and cannot afford to accommodate an additional draft model.

tqchen commented 1 month ago

Likely the self speculating models like eagle would help in this case

MrRace commented 1 month ago

Likely the self speculating models like eagle would help in this case

@tqchen How to use the Eagle inference acceleration on an Android phone with MLC-LLM? Thanks a lot!

MrRace commented 1 month ago

Likely the self speculating models like eagle would help in this case

@tqchen when use eagle, it seems need to train a draft model~

mlc-ai / mlc-llm