Open tomyoung903 opened 2 months ago
Yes, we've conducted conducted all our experiments with cropping.
Padding would work regarding retaining all information, but would downgrade the resolution and doesn't aligns to the training strategy of image models (CLIP and llava-next).
I notice that the image processor crops rectangular images into square images, which inevitably loses some information.
It seems that cropping is also used during training.
What if we want to caption rectangular videos without losing edges to cropping?