Closed HenryHZY closed 2 years ago
Hi @HenryHZY,
--batch_size
double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now. Hi @HenryHZY,
- You can test on 4 GPUs instead of 8 GPUs, or make the
--batch_size
double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.- The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information. Thanks.
@ArrowLuo Thanks for your quick reply! Actually, I have also tested with 4 A100 GPUs. Double batch_size experiment with 8 A100 GPUs will be conducted later.
retrieval, FT-Align, 4 A100 GPUs
R@1: 0.2510 - R@5: 0.5780 - R@10: 0.7010 - Median R: 4.0
Maybe I need to change some parameters, such as epochs, batch_size and lr, to obtain a better result?
Do you have any other experience sharing on the fine-tuning experiment? For example, just like your answer for https://github.com/microsoft/UniVL/issues/18, to increase the batch_size as much as possible to use my GPUs.
Hi @HenryHZY, yes, the epochs, batch_size, and lr are important for the retrieval tasks. I can not remember other details/tricks to do fine-tuning now due to a long time away.
Hi, @ArrowLuo. I would like to ask if the input of UniVL is video-sentences or clip-sentence or clip-sentences?
Following your instruction, I obtain the video features and text features. Given a video_id_x that has a time interval [0, m-1 seconds], after feature extraction, video_id_x.npy is a np.array with a shape of [m, 1024].
Supposed that video_id_x has n video clips with n responding sentences. (defined in the caption.pickle)
"video_id_x":{
"start":[s_1, s_2, ..., s_n],
"end":[e_1, e_2, ..., e_n],
"text":["t_1", "t_2", ..., "t_n"]
}
/ /
Then, what is the shape of the original input tokens to UniVL? A single video clip and its one sentence? Take the time interval [s_1, e_1] of the first video clip for an example:
video tokens: [e_1-s_1+1, 1024]
text tokens: [tokens_sum_of_t_1, word_token_embedding_size]
Are all the above data formats correct, including [m, 1024], [e_1-s_1+1, 1024] and [tokens_sum_of_t_1, word_token_embedding_size]?
Thanks for your time!
Hi, @ArrowLuo Thanks for your great project! I would like to ask some questions on retrieval result and "Info: Weight doesn't exsits"
retrieval, FT-Align, 8 A100 GPUs R@1: 0.2620 - R@5: 0.5500 - R@10: 0.6920 - Median R: 4.0
The results (FT-Joint) are close to R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0
INFO: Weight doesn't exsits. /nvme/UniVL/modules/visual-base/visual_pytorch_model.bin INFO: Weight doesn't exsits. /nvme/UniVL/modules/cross-base/cross_pytorch_model.bin INFO: Weight doesn't exsits. /nvme/UniVL/modules/decoder-base/decoder_pytorch_model.bin