Open OpenJarvisAI opened 5 months ago
As I known, clip vit 336 would produce 576 visual tokens per image, the UHD are stack 6 of them, that is 3000+ visual tokens. How does it able to send to LLM?
As I known, clip vit 336 would produce 576 visual tokens per image, the UHD are stack 6 of them, that is 3000+ visual tokens. How does it able to send to LLM?