salesforce / ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts
BSD 3-Clause "New" or "Revised" License
185 stars 18 forks source link

An academic issues on your paper #14

Closed chenhaishun closed 2 years ago

chenhaishun commented 2 years ago

In the video encoder part, the output is {v_cls, v_1, ..., v_k} (so the dimension is (k+1)d) therefore, the dimension of multi-modal video-text encoder is (k+N_t+1)d but according to paper: you claim that the dimension of multi-modal video-text encoder is (N_v+N_t+1)*d I'm confused about this...

dxli94 commented 2 years ago

Thanks for well spotting. There seems to be indeed a typo here: the multi-modal encoder output should have (K + N_t + 1) tokens, with K the number of patches, N_T the number of tokens and 1 for class token.

Thanks.

chenhaishun commented 2 years ago

Thanks for your reply!