Closed Monstertail closed 5 months ago
First of all, in the "Decoding Algorithms" branch of Inference, most of the decoding algorithms target for the efficiency of certain scenarios.
Based on this fact, we can have these three sub-branches for "Decoding Algorithms" according to the scenarios: 1) Long context scenarios. e.g Streaming LLM, Infinite LLM; 2) structured interaction scenarios. e.g. my latest work DeFT is for tree-based decoding efficiency, so does SGLang from Berkeley with efficient memory management of tree-based decoding. It could be expanded to graphs in multi-agent scenarios. 3) Non-autoregressive/Parallel/Speculative decoding. To generate more than one tokens in each decoding step. Dozens of papers.
Hi @Monstertail,
Thanks for your suggestion on the sub-branches for "decoding algorithm". In fact the paper covers most if not all of your suggested works although not all of them are listed here due to space constraint (and clarity). But I do think your suggestion on the classification makes sense, which I plan to incorporate into the next version. Thanks for the nice feedback.
Best, Jingyu
I notice the classification LLM inference is kind of coarse-grained. Therefore I open this issue to keep updating suggestions.