Closed celestialxevermore closed 1 year ago
Hi, you may checkout our latest vision-language library which include script for automatic dataset downloading: https://github.com/salesforce/LAVIS
idx refers to the image index. The reason it is used in retrieval is because Flickr and COCO both have multiple text per image, hence idx is used to tell if two texts refer to the same image.
Thanks a lot. Your response gave me help.
Dear Author, Thank you for the great paper, and also distributing your code as open source.
I want to train and evaluate your model, But, How to download flickr30k dataset, or COCO dataset?
Plus, I want to know what is idx exactly, https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_retrieval.py#L73 https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_pretrain.py#L88
Finally, I found that the function _dequeue_and_enque input parameters in model_pretrain.py are different from model_retrieval.py, so can you give me reason why? https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_pretrain.py#L130 https://github.com/salesforce/ALBEF/blob/b9727e43c3040491774d1b22cc27718aa7772fac/models/model_retrieval.py#L107
I'm the very novice for multimodal and Deep Learning and Python, so I will really feel thankful if you help me.
Sincerly,