hasanar1f / HiRED

HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.
https://www.arxiv.org/abs/2408.10945
MIT License
9 stars 3 forks source link

pre-trained models. #1

Open Han-jiaxin opened 1 week ago

Han-jiaxin commented 1 week ago

Hello,

I am very interested in your work and was wondering when you might be providing the training scripts and pre-trained models.

Thank you!

hasanar1f commented 1 week ago

Hello. Thank you for your interest. There are no training scripts since this is a training-free approach. The pre-trained models can be found in the huggingface repo, i.e., LLaVA-Next-7B: https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf. When you run the evaluation scripts, it will automatically download the weights and dataset.

Thanks