thunlp / LLaVA-UHD

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
320 stars 15 forks source link

[Question] Proof about Range of Slice Aspect Ratios #15

Open JJJYmmm opened 7 months ago

JJJYmmm commented 7 months ago

image image

It seems that $| \log r|$ should be $|\log {\frac {W_I}{H_I} } + \log {\frac {n}{m} }|$

JJJYmmm commented 7 months ago

Another question: since README says the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5).. Why the training is more efficient?

It seems that image modularization strategy will cost more time or memory usage in image encoding stage(one image are divided into serval parts).

So the efficiency is due to fewer visual tokens(perceiver than mlp projection)? Looking forward to your reply :)