salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

About dataset #99

Closed celestialxevermore closed 1 year ago

celestialxevermore commented 1 year ago

Dear author, I hope you will make good Thanksgiving days!

I am really fascinated by your paper, and feel thankful for your code open source also.

I have a question about what the 'idx' is exactly means.

for i,(image, text, idx) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):

I have treated MSVD and MSRVTT dataset only, which contains

text_ids, text_attention_mask, text_token_type_ids, and for vision modality, raw image and image_mask.

but I cannot make sense of what the 'idx' means in MSCOCO dataset.

I've guess that the 'idx' may means the index of video and text pair, but

I cannot find the meaning exactly.

thank you for reading my questions.