hasanar1f / HiRED

HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.
https://www.arxiv.org/abs/2408.10945
MIT License
9 stars 3 forks source link