mshukor / ViCHA

[BMVC22] Official Implementation of ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment"
MIT License
52 stars 1 forks source link

Visual Genome #3

Closed kimihailv closed 1 year ago

kimihailv commented 1 year ago

Hello! Do I understand correctly that you use the crops of regions with their descriptions for training in case of Visual Genome?

mshukor commented 1 year ago

Hi, we consider each region description as an image caption, we didn't use the image crop, in total we have ~5M pairs for visual genome.

kimihailv commented 1 year ago

Thanks