Is there a plan to handle the inference slowness？ eg. KV Cache

lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers

MIT License

4.58k stars 390 forks source link

Is there a plan to handle the inference slowness？ eg. KV Cache #170

Closed liuzhuang1024 closed 1 year ago

lucidrains commented 1 year ago

oh yup, can add this

was going to play around with speculative and contrastive decoding soon too

lucidrains commented 1 year ago

unless if you want to give it a shot with a PR

liuzhuang1024 commented 1 year ago

unless if you want to give it a shot with a PR

Maybe, when I have free time.

lucidrains commented 1 year ago

k no prob, should be able to get this done by this week's end, and play around with speculative decoding too

lucidrains commented 1 year ago

@liuzhuang1024 hey, started playing around with spec decoding, and decided to circle back to this issue https://github.com/lucidrains/x-transformers/commit/87a0f13d7730869d1e3b4af384f6813e08fd0021 let me know if it works ok

lucidrains commented 1 year ago

also, if anyone knows any paper with interesting (and unimplemented) ideas for speeding up causal transformer sampling, do share it now while my attention is on this

LouChao98 commented 1 year ago

Some paper propose to prune kv cache to speed up long sequence generation, based on attention scores received in the history.

https://arxiv.org/abs/2306.14048

https://arxiv.org/abs/2305.17118

lucidrains commented 1 year ago

@LouChao98 ah nice, but i don't know if i believe in that route

lucidrains commented 1 year ago

i think this should be completed

i'll get around to some savings with absolute positional embedding at a later date