你好，有没有尝试过只在某些层进行lookup，这样就可以减少缓存KV的数量

thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

MIT License

269 stars 21 forks source link

Closed MrJiangZhongZheng closed 5 months ago

MrJiangZhongZheng commented 5 months ago

MrJiangZhongZheng commented 5 months ago

https://arxiv.org/pdf/2203.08913.pdf 这篇文章证明在某些层进行lookup似乎更好

guyan364 commented 5 months ago

目前没有尝试，你可以修改 patch 为不同层使用不同的 streaming attention，比如0-7、24-31层使用 infinite lm，我们计划之后加入这个功能的支持。