Open sm745052 opened 4 days ago
Hey @sm745052 !
Indeed the paper says that multiple images should not be divided into patches but the LLaVA-OV model was shipped following the inference pipeline from their demo notebook here . In the notebook the images are divided into patches in mutli-image cases.
I think they might have made a mistake in the notebook thus we shipped the wrong inference model, or the paper meant using only base image for tuning the model with many images. Can you also open an issue in LLaVa-VL repo to clarify this, and we can make the appropriate changes if they confirm that the inference notebook is not correct :)
I was trying to find the implementation of where the patches are being created. What I understand according to paper is that when there are multiple images, complete images should be used instead of creating patches for each image. however, I could not find the implementation to it.
I was looking here https://github.com/huggingface/transformers/blob/33868a057c02f0368ba63bd1edb746be38fe3d90/src/transformers/models/llava_onevision/image_processing_llava_onevision.py#L680
And am referring to this excerpt
any help would be really helpful.