salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

About Dataset #105

Closed celestialxevermore closed 1 year ago

celestialxevermore commented 1 year ago

Dear Author, Thank you for the great paper, and also distributing your code as open source.

I want to train and evaluate your model, But, How to download flickr30k dataset, or COCO dataset?

Plus, I want to know what is idx exactly, https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_retrieval.py#L73 https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_pretrain.py#L88

Finally, I found that the function _dequeue_and_enque input parameters in model_pretrain.py are different from model_retrieval.py, so can you give me reason why? https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_pretrain.py#L130 https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_retrieval.py#L107

I'm the very novice for multimodal and Deep Learning and Python, so I will really feel thankful if you help me.

Sincerly,

LiJunnan1992 commented 1 year ago

Hi, you may checkout our latest vision-language library which include script for automatic dataset downloading: https://github.com/salesforce/LAVIS

idx refers to the image index. The reason it is used in retrieval is because Flickr and COCO both have multiple text per image, hence idx is used to tell if two texts refer to the same image.

celestialxevermore commented 1 year ago

Thanks a lot. Your response gave me help.