Open JJJYmmm opened 7 months ago
Another question: since README says the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5).
. Why the training is more efficient?
It seems that image modularization strategy will cost more time or memory usage in image encoding stage(one image are divided into serval parts).
So the efficiency is due to fewer visual tokens(perceiver than mlp projection)? Looking forward to your reply :)
It seems that $| \log r|$ should be $|\log {\frac {W_I}{H_I} } + \log {\frac {n}{m} }|$