Closed cty8998 closed 3 months ago
Thank you for your interest. I'm sorry, the batchsize should be 256 (original version is 128). I retrained the model using official code on another reconfigured server (4 x V100), the training log file can be found at PAU/log/trainlog.out. It seems the result is reasonable.
The best model checkpoint trained in this time can be downloaded at url. The learned $\beta_1$: -0.10314916992187637, $\beta_2$: 0.1449316406249987.
I also re-evaluate my released model "MSRVTT-pytorch_model.bin.0". The test log file is as follows.
If there is a large performance discrepancy. I guess some environmental difference may lead it, such as the torch version. Do you use the torch version greater than 2.0? Could you please provide more detailed environment information for further analysis?
01/24/2024 20:52:07 - INFO - Effective parameters:
01/24/2024 20:52:07 - INFO - <<< K: 8
01/24/2024 20:52:07 - INFO - <<< batch_size: 256
01/24/2024 20:52:07 - INFO - <<< batch_size_val: 40
01/24/2024 20:52:07 - INFO - <<< cache_dir:
01/24/2024 20:52:07 - INFO - <<< coef_lr: 0.001
01/24/2024 20:52:07 - INFO - <<< cross_model: cross-base
01/24/2024 20:52:07 - INFO - <<< cross_num_hidden_layers: 4
01/24/2024 20:52:07 - INFO - <<< data_path: ../data/VR_Dataset/MSRVTT/msrvtt_data/MSRVTT_data.json
01/24/2024 20:52:07 - INFO - <<< datatype: msrvtt
01/24/2024 20:52:07 - INFO - <<< do_eval: True
01/24/2024 20:52:07 - INFO - <<< do_lower_case: False
01/24/2024 20:52:07 - INFO - <<< do_pretrain: False
01/24/2024 20:52:07 - INFO - <<< do_train: False
01/24/2024 20:52:07 - INFO - <<< epochs: 5
01/24/2024 20:52:07 - INFO - <<< eval_frame_order: 0
01/24/2024 20:52:07 - INFO - <<< expand_msrvtt_sentences: True
01/24/2024 20:52:07 - INFO - <<< feature_framerate: 1
01/24/2024 20:52:07 - INFO - <<< features_path: ../data/VR_Dataset/MSRVTT/videos/all
01/24/2024 20:52:07 - INFO - <<< fp16: False
01/24/2024 20:52:07 - INFO - <<< fp16_opt_level: O1
01/24/2024 20:52:07 - INFO - <<< freeze_layer_num: 0
01/24/2024 20:52:07 - INFO - <<< gradient_accumulation_steps: 1
01/24/2024 20:52:07 - INFO - <<< hard_negative_rate: 0.5
01/24/2024 20:52:07 - INFO - <<< init_model: /home/lihao/PAU/log/MSRVTT-pytorch_model.bin.0
01/24/2024 20:52:07 - INFO - <<< lambda1: 1
01/24/2024 20:52:07 - INFO - <<< lambda2: 100
01/24/2024 20:52:07 - INFO - <<< lambda3: 0.025
01/24/2024 20:52:07 - INFO - <<< linear_patch: 2d
01/24/2024 20:52:07 - INFO - <<< local_rank: 0
01/24/2024 20:52:07 - INFO - <<< loose_type: True
01/24/2024 20:52:07 - INFO - <<< lr: 0.0001
01/24/2024 20:52:07 - INFO - <<< lr_decay: 0.9
01/24/2024 20:52:07 - INFO - <<< margin: 0.1
01/24/2024 20:52:07 - INFO - <<< max_frames: 12
01/24/2024 20:52:07 - INFO - <<< max_words: 32
01/24/2024 20:52:07 - INFO - <<< n_display: 10
01/24/2024 20:52:07 - INFO - <<< n_gpu: 1
01/24/2024 20:52:07 - INFO - <<< n_pair: 1
01/24/2024 20:52:07 - INFO - <<< negative_weighting: 1
01/24/2024 20:52:07 - INFO - <<< num_thread_reader: 8
01/24/2024 20:52:07 - INFO - <<< output_dir: log
01/24/2024 20:52:07 - INFO - <<< precision: fp16
01/24/2024 20:52:07 - INFO - <<< pretrained_clip_name: ViT-B/32
01/24/2024 20:52:07 - INFO - <<< rank: 0
01/24/2024 20:52:07 - INFO - <<< rerank_coe_t: 0.05
01/24/2024 20:52:07 - INFO - <<< rerank_coe_v: 0.05
01/24/2024 20:52:07 - INFO - <<< resume_opt: None
01/24/2024 20:52:07 - INFO - <<< sampled_use_mil: False
01/24/2024 20:52:07 - INFO - <<< seed: 42
01/24/2024 20:52:07 - INFO - <<< sim_header: seqTransf
01/24/2024 20:52:07 - INFO - <<< slice_framepos: 2
01/24/2024 20:52:07 - INFO - <<< task_type: retrieval
01/24/2024 20:52:07 - INFO - <<< tau: 5
01/24/2024 20:52:07 - INFO - <<< text_num_hidden_layers: 12
01/24/2024 20:52:07 - INFO - <<< train_csv: ../data/VR_Dataset/MSRVTT/msrvtt_data/MSRVTT_train.9k.csv
01/24/2024 20:52:07 - INFO - <<< train_frame_order: 0
01/24/2024 20:52:07 - INFO - <<< use_mil: False
01/24/2024 20:52:07 - INFO - <<< val_csv: ../data/VR_Dataset/MSRVTT/msrvtt_data/MSRVTT_JSFUSION_test.csv
01/24/2024 20:52:07 - INFO - <<< video_dim: 1024
01/24/2024 20:52:07 - INFO - <<< visual_num_hidden_layers: 12
01/24/2024 20:52:07 - INFO - <<< warmup_proportion: 0.1
01/24/2024 20:52:07 - INFO - <<< world_size: 1
01/24/2024 20:52:07 - INFO - device: cuda:0 n_gpu: 1
01/24/2024 20:52:09 - INFO - loading archive file /home/lihao/PAU/modules/cross-base
01/24/2024 20:52:09 - INFO - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 512,
"initializer_range": 0.02,
"intermediate_size": 2048,
"max_position_embeddings": 128,
"num_attention_heads": 8,
"num_hidden_layers": 4,
"type_vocab_size": 2,
"vocab_size": 512
}
01/24/2024 20:52:09 - INFO - Weight doesn't exsits. /home/lihao/PAU/modules/cross-base/cross_pytorch_model.bin
01/24/2024 20:52:09 - WARNING - Stage-One:True, Stage-Two:False
01/24/2024 20:52:09 - WARNING - Test retrieval by loose type.
01/24/2024 20:52:09 - WARNING - embed_dim: 512
01/24/2024 20:52:09 - WARNING - image_resolution: 224
01/24/2024 20:52:09 - WARNING - vision_layers: 12
01/24/2024 20:52:09 - WARNING - vision_width: 768
01/24/2024 20:52:09 - WARNING - vision_patch_size: 32
01/24/2024 20:52:09 - WARNING - context_length: 77
01/24/2024 20:52:09 - WARNING - vocab_size: 49408
01/24/2024 20:52:09 - WARNING - transformer_width: 512
01/24/2024 20:52:09 - WARNING - transformer_heads: 8
01/24/2024 20:52:09 - WARNING - transformer_layers: 12
01/24/2024 20:52:09 - WARNING - linear_patch: 2d
01/24/2024 20:52:09 - WARNING - cut_top_layer: 0
01/24/2024 20:52:10 - WARNING - sim_header: seqTransf
/home/lihao/miniconda3/envs/xclip/lib/python3.8/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
warnings.warn(warning.format(ret))
01/24/2024 20:52:17 - INFO - --------------------
01/24/2024 20:52:17 - INFO - Weights of PAU not initialized from pretrained model:
beta_1
beta_2
global_mat_weight
word_logit_weight
frame_logit_weight
local_mat_weight
frame_mat_weight
word_mat_weight
frame_mat_weight2
word_mat_weight2
01/24/2024 20:52:17 - INFO - Weights from pretrained model not used in PAU:
clip.input_resolution
clip.context_length
clip.vocab_size
01/24/2024 20:52:17 - INFO - ***** Running test *****
01/24/2024 20:52:17 - INFO - Num examples = 1000
01/24/2024 20:52:17 - INFO - Batch size = 40
01/24/2024 20:52:17 - INFO - Num steps = 25
01/24/2024 20:52:17 - INFO - ***** Running val *****
01/24/2024 20:52:17 - INFO - Num examples = 1000
0/25
1/25
2/25
3/25
4/25
5/25
6/25
7/25
8/25
9/25
10/25
11/25
12/25
13/25
14/25
15/25
16/25
17/25
18/25
19/25
20/25
21/25
22/25
23/25
24/25
01/24/2024 20:53:29 - INFO - sim matrix size: 1000, 1000
01/24/2024 20:53:29 - INFO - Length-T: 1000, Length-V:1000
01/24/2024 20:53:29 - INFO - Text-to-Video:
01/24/2024 20:53:29 - INFO - >>> R@1: 48.5 - R@5: 72.7 - R@10: 82.5 - Median R: 2.0 - Mean R: 14.0
01/24/2024 20:53:29 - INFO - Video-to-Text:
01/24/2024 20:53:29 - INFO - >>> V2T$R@1: 48.0 - V2T$R@5: 72.7 - V2T$R@10: 83.0 - V2T$Median R: 2.0 - V2T$Mean R: 9.9
I reproduce your code on 4 A800 GPUs. The final results are shown below:
The best model is "pytorch_model.bin.3", and the performance has a large discrepancy with the result in the official paper.
Then I directly evaluate your released model "MSRVTT-pytorch_model.bin.0", the results are shown below:
It also has a large discrepancy with the preformance in the official paper.
Please explain the reason, thank you.