Isn't 2048 max lenght would out of context if 6 images put in?

thunlp / LLaVA-UHD

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

303 stars 15 forks source link

Isn't 2048 max lenght would out of context if 6 images put in? #16

Open OpenJarvisAI opened 5 months ago

OpenJarvisAI commented 5 months ago

As I known, clip vit 336 would produce 576 visual tokens per image, the UHD are stack 6 of them, that is 3000+ visual tokens. How does it able to send to LLM?