microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

Questions on retrieval result and "Info: Weight doesn't exsits" #24

Closed HenryHZY closed 2 years ago

HenryHZY commented 2 years ago

Hi, @ArrowLuo Thanks for your great project! I would like to ask some questions on retrieval result and "Info: Weight doesn't exsits"

  1. I have finished fine-tuning on MSR-VTT retrieval&captioning with pre-trained model Why is the retrieval result with FT-Align a bit lower than the FT-Joint in readme? By the way, I just directly used the default setting with your command in readme.
    
    retrieval, FT-Joint, 8 A100 GPUs
    R@1: 0.2560 - R@5: 0.5510 - R@10: 0.6860 - Median R: 4.0

retrieval, FT-Align, 8 A100 GPUs R@1: 0.2620 - R@5: 0.5500 - R@10: 0.6920 - Median R: 4.0

The results (FT-Joint) are close to R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0


2. I have also finishing extracting video features, pre-training on HowTo100M and fine-tuning on MSR-VTT retrieval&captioning
I am curious about the message "Info: Weight doesn't exists" that appears both in pre-training and fine-tuning.
It seems that they are used to to remind me to load the pre-trained video encoder, cross encoder and decoder respectively.
Have you conducted some experiments with these pre-trained modules? 

INFO: Weight doesn't exsits. /nvme/UniVL/modules/visual-base/visual_pytorch_model.bin INFO: Weight doesn't exsits. /nvme/UniVL/modules/cross-base/cross_pytorch_model.bin INFO: Weight doesn't exsits. /nvme/UniVL/modules/decoder-base/decoder_pytorch_model.bin

ArrowLuo commented 2 years ago

Hi @HenryHZY,

  1. You can test on 4 GPUs instead of 8 GPUs, or make the --batch_size double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.
  2. The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information. Thanks.
HenryHZY commented 2 years ago

Hi @HenryHZY,

  1. You can test on 4 GPUs instead of 8 GPUs, or make the --batch_size double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.
  2. The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information. Thanks.

@ArrowLuo Thanks for your quick reply! Actually, I have also tested with 4 A100 GPUs. Double batch_size experiment with 8 A100 GPUs will be conducted later.

retrieval, FT-Align, 4 A100 GPUs
R@1: 0.2510 - R@5: 0.5780 - R@10: 0.7010 - Median R: 4.0

Maybe I need to change some parameters, such as epochs, batch_size and lr, to obtain a better result?

Do you have any other experience sharing on the fine-tuning experiment? For example, just like your answer for https://github.com/microsoft/UniVL/issues/18, to increase the batch_size as much as possible to use my GPUs.

ArrowLuo commented 2 years ago

Hi @HenryHZY, yes, the epochs, batch_size, and lr are important for the retrieval tasks. I can not remember other details/tricks to do fine-tuning now due to a long time away.

HenryHZY commented 2 years ago

Hi, @ArrowLuo. I would like to ask if the input of UniVL is video-sentences or clip-sentence or clip-sentences?

Following your instruction, I obtain the video features and text features. Given a video_id_x that has a time interval [0, m-1 seconds], after feature extraction, video_id_x.npy is a np.array with a shape of [m, 1024].

Supposed that video_id_x has n video clips with n responding sentences. (defined in the caption.pickle)

"video_id_x":{
        "start":[s_1, s_2, ..., s_n],
        "end":[e_1, e_2, ..., e_n],
        "text":["t_1", "t_2", ..., "t_n"]
    }

/ /

Then, what is the shape of the original input tokens to UniVL? A single video clip and its one sentence? Take the time interval [s_1, e_1] of the first video clip for an example:

video tokens: [e_1-s_1+1, 1024]
text tokens: [tokens_sum_of_t_1, word_token_embedding_size]

Are all the above data formats correct, including [m, 1024], [e_1-s_1+1, 1024] and [tokens_sum_of_t_1, word_token_embedding_size]?

Thanks for your time!