microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 35 forks source link

Training problems when reproduce the results of MSRVTT dataset #13

Open yaolinli opened 2 years ago

yaolinli commented 2 years ago

Hi, I want to reproduce the results of MSRVTT dataset by training the model from scratch. Before training from scratch, I have reproduced the MSRVTT results using the officially released checkpoint (CIDEr 54.7 on val, CIDEr 54.3 on test). Then I use the provided codes to train the model, the problem is that the MLM accuracy will suddenly drop after a few training epochs. Training logs are the following: In epoch 3, mlm acc drops to around 0.1, val set CIDEr drops to 0.0 I train with apex O1 or apex O0 method.

05/26/2022 21:20:08 - INFO - __main__ -   Save checkpoint to ./experiments/output_msrvtt_new/checkpoint-2-10854
05/26/2022 21:20:08 - INFO - __main__ -   Perform evaluation at iteration 43416, global_step 10854
05/26/2022 21:21:27 - INFO - __main__ -   Inference model computing time: 0.9087514216641346 seconds per batch
05/26/2022 21:21:52 - INFO - __main__ -   evaluation result: {'Bleu_1': 0.7691577416525643, 'Bleu_2': 0.6165830281270223, 'Bleu_3': 0.47283724241898767, 'Bleu_4': 0.34628522979313686, 'METEOR': 0.25367347312377875, 'ROUGE_L': 0.5866882628069936, 'CIDEr': 0.3412248963946503, 'SPICE': 0.049780527329883785}
05/26/2022 21:21:52 - INFO - __main__ -   evaluation result saved to ./experiments/output_msrvtt_new/checkpoint-2-10854/pred.MSRVTT-v2.val_32frames.beam1.max20.eval.json
05/26/2022 21:22:30 - INFO - __main__ -   eta: 4 days, 20:42:24  iter: 43440  global_step: 10860  speed: 0.5 images/sec  loss: 4.2006 (4.8763)  loss_sparsity: 0.2658 (0.3960)  acc: 0.3333 (0.2884)  batch_time: 1.5384 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:24:35 - INFO - __main__ -   eta: 4 days, 20:50:39  iter: 43520  global_step: 10880  speed: 1.0 images/sec  loss: 4.2026 (4.8756)  loss_sparsity: 0.2658 (0.3957)  acc: 0.2647 (0.2885)  batch_time: 1.5414 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:26:41 - INFO - __main__ -   eta: 4 days, 20:43:37  iter: 43600  global_step: 10900  speed: 1.0 images/sec  loss: 4.6793 (4.8763)  loss_sparsity: 0.2660 (0.3955)  acc: 0.2727 (0.2883)  batch_time: 1.5409 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.78e-04  max mem: 38796
05/26/2022 21:28:46 - INFO - __main__ -   eta: 4 days, 20:38:21  iter: 43680  global_step: 10920  speed: 1.0 images/sec  loss: 3.9995 (4.8757)  loss_sparsity: 0.2661 (0.3953)  acc: 0.3214 (0.2883)  batch_time: 1.5404 (1.5709)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.39e-05  lr (LM): 2.77e-04  max mem: 38796
........
05/26/2022 22:39:55 - INFO - __main__ -   eta: 4 days, 19:14:52  iter: 46400  global_step: 11600  speed: 1.0 images/sec  loss: 4.2682 (4.8405)  loss_sparsity: 0.2615 (0.3876)  acc: 0.3429 (0.2903)  batch_time: 1.5413 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:42:01 - INFO - __main__ -   eta: 4 days, 19:00:52  iter: 46480  global_step: 11620  speed: 1.0 images/sec  loss: 5.2393 (4.8417)  loss_sparsity: 0.2613 (0.3874)  acc: 0.1944 (0.2901)  batch_time: 1.5403 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:44:06 - INFO - __main__ -   eta: 4 days, 19:05:06  iter: 46560  global_step: 11640  speed: 1.0 images/sec  loss: 5.4463 (4.8426)  loss_sparsity: 0.2615 (0.3872)  acc: 0.2000 (0.2899)  batch_time: 1.5401 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:46:11 - INFO - __main__ -   eta: 4 days, 18:53:37  iter: 46640  global_step: 11660  speed: 1.0 images/sec  loss: 5.8740 (4.8429)  loss_sparsity: 0.2616 (0.3870)  acc: 0.1081 (0.2899)  batch_time: 1.5391 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:48:17 - INFO - __main__ -   eta: 4 days, 18:51:16  iter: 46720  global_step: 11680  speed: 1.0 images/sec  loss: 6.3330 (4.8455)  loss_sparsity: 0.2613 (0.3868)  acc: 0.0741 (0.2895)  batch_time: 1.5399 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.74e-04  max mem: 38796
05/26/2022 22:50:22 - INFO - __main__ -   eta: 4 days, 18:46:13  iter: 46800  global_step: 11700  speed: 1.0 images/sec  loss: 6.1326 (4.8479)  loss_sparsity: 0.2611 (0.3866)  acc: 0.0526 (0.2892)  batch_time: 1.5395 (1.5708)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:52:28 - INFO - __main__ -   eta: 4 days, 18:34:46  iter: 46880  global_step: 11720  speed: 1.0 images/sec  loss: 6.0650 (4.8498)  loss_sparsity: 0.2600 (0.3864)  acc: 0.1143 (0.2889)  batch_time: 1.5386 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:54:33 - INFO - __main__ -   eta: 4 days, 18:36:20  iter: 46960  global_step: 11740  speed: 1.0 images/sec  loss: 5.8276 (4.8519)  loss_sparsity: 0.2579 (0.3861)  acc: 0.1250 (0.2886)  batch_time: 1.5383 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:56:38 - INFO - __main__ -   eta: 4 days, 18:56:10  iter: 47040  global_step: 11760  speed: 1.0 images/sec  loss: 5.7397 (4.8539)  loss_sparsity: 0.2557 (0.3859)  acc: 0.1064 (0.2883)  batch_time: 1.5380 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 22:58:44 - INFO - __main__ -   eta: 4 days, 19:09:41  iter: 47120  global_step: 11780  speed: 1.0 images/sec  loss: 6.1058 (4.8561)  loss_sparsity: 0.2536 (0.3857)  acc: 0.1111 (0.2880)  batch_time: 1.5382 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.37e-05  lr (LM): 2.73e-04  max mem: 38796
05/26/2022 23:00:50 - INFO - __main__ -   eta: 4 days, 18:31:12  iter: 47200  global_step: 11800  speed: 1.0 images/sec  loss: 6.1348 (4.8582)  loss_sparsity: 0.2515 (0.3855)  acc: 0.0857 (0.2877)  batch_time: 1.5371 (1.5707)  data_time: 0.0002 (0.0002)  lr (Visual Encoder): 1.36e-05  lr (LM): 2.73e-04  max mem: 38796
.........
05/27/2022 06:49:13 - INFO - __main__ -   ModelSaver save trial NO. 0
05/27/2022 06:49:17 - INFO - __main__ -   Save checkpoint to ./experiments/output_msrvtt_new/checkpoint-3-16281
05/27/2022 06:49:17 - INFO - __main__ -   Perform evaluation at iteration 65124, global_step 16281
05/27/2022 06:51:28 - INFO - __main__ -   Inference model computing time: 1.5383756045835564 seconds per batch
05/27/2022 06:51:47 - INFO - __main__ -   evaluation result: {'Bleu_1': 0.1594008495416768, 'Bleu_2': 0.008687056706128082, 'Bleu_3': 2.1171728390057318e-08, 'Bleu_4': 3.3589638000914805e-11, 'METEOR': 0.05546962152086046, 'ROUGE_L': 0.21322680785039416, 'CIDEr': 1.1242001133315049e-05, 'SPICE': 0.0}
05/27/2022 06:51:47 - INFO - __main__ -   evaluation result saved to ./experiments/output_msrvtt_new/checkpoint-3-16281/pred.MSRVTT-v2.val_32frames.beam1.max20.eval.json
XIUXIUXIUBIUA commented 2 years ago

I had a similar problem

Yuhan-Shen commented 2 years ago

Any updates? I had a similar problem when training on MSVD. The first five epochs looked good, but CIDEr dropped to zero after epoch 6. Is there any reason and/or solution?

tes4j commented 2 years ago

I have the same problem, CIDEr drops to 0.0 in epoch 4 when reproducing the results of MSRVTT dataset

YoussefZiad commented 2 years ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Luckygyana commented 2 years ago

@YoussefZiad, Can you help me reproduce the result? When I am using the docker environment I am getting read only file system error.

YoussefZiad commented 2 years ago

@YoussefZiad, Can you help me reproduce the result? When I am using the docker environment I am getting read only file system error.

Hi, can you show me what the error traceback looks like? I didn't use the docker environment personally (I set it up locally) but maybe I can help if I ran into a similar issue.

Luckygyana commented 2 years ago

08/14/2022 14:19:24 - INFO - main - yaml_file:MSRVTT-v2/train_32frames.yaml Traceback (most recent call last): File "src/tasks/run_caption_VidSwinBert.py", line 679, in main(args) File "src/tasks/run_caption_VidSwinBert.py", line 657, in main train_dataloader = make_data_loader(args, args.train_yaml, tokenizer, args.distributed, is_train=True) File "/home/jupyter/mount/SwinBERT/src/datasets/vl_dataloader.py", line 87, in make_data_loader dataset = build_dataset(args, yaml_file, tokenizer, is_train=is_train) File "/home/jupyter/mount/SwinBERT/src/datasets/vl_dataloader.py", line 22, in build_dataset return dataset_class(args, yaml_file, tokenizer, tensorizer, is_train, args.on_memory) File "/home/jupyter/mount/SwinBERT/src/datasets/vision_language_tsv.py", line 365, in init args, yaml_file, tokenizer, tensorizer, is_train, on_memory) File "/home/jupyter/mount/SwinBERT/src/datasets/vision_language_tsv.py", line 44, in init self.visual_tsv = self.get_tsv_file(self.visual_file) File "/home/jupyter/mount/SwinBERT/src/datasets/vision_language_tsv.py", line 129, in get_tsv_file tsv_path = find_file_path_in_yaml(tsv_file, self.root) File "/home/jupyter/mount/SwinBERT/src/utils/load_files.py", line 74, in find_file_path_in_yaml errno.ENOENT, os.strerror(errno.ENOENT), op.join(root, fname) FileNotFoundError: [Errno 2] No such file or directory: 'datasets/MSRVTT-v2/frame_tsv/train_32frames.img.tsv'

when i am using in local env I am getting this error, can you help me resolve it?

YoussefZiad commented 2 years ago

It looks like the default yaml file is using a different naming convention, which is why it's looking for the wrong filename. I faced a similar issue with the VATEX annotations.

Open the msrvtt_8frm_default.json file in the src/config/VidSwinBert, find the "train_yaml" and "val_yaml" attribute and remove the '_32frames' from the filename (so they become train.yaml and val.yaml). It should find the correct files afterwards.

tiesanguaixia commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparsew', I guess it's the regularization hyperparameter of $Loss{SPARSE}$ , i.e. the $\lambda$ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

YoussefZiad commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest.

(Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

tiesanguaixia commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest.

(Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

Hi! Thank you for your reply! I realize your settings.

There is no need to apologize:smiley:, and I have set up a conda environment. But may I ask how you set it up locally with out conda?:open_mouth:

From the checkpoint released by the author, I see the learning rate of his environment is 0.0003 truly. So it seems like he did not make mistakes in command. But why many of us have to adjust it to 0.00003 and how do you find this number? Do you have any idea about this?:flushed:

Looking forward to your reply! Thank you a lot!

YoussefZiad commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest. (Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

Hi! Thank you for your reply! I realize your settings.

There is no need to apologize😃, and I have set up a conda environment. But may I ask how you set it up locally with out conda?😮

From the checkpoint released by the author, I see the learning rate of his environment is 0.0003 truly. So it seems like he did not make mistakes in command. But why many of us have to adjust it to 0.00003 and how do you find this number? Do you have any idea about this?😳

Looking forward to your reply! Thank you a lot!

Hi! As for my environment, I just use pip to install my packages (just the way im used to do it, I never really tried to use conda before :p)

For the learning rate, I'm not sure why the author's 0.0003 worked for them (Maybe other hyperparameters were adjusted?), but for our case here how I found this learning rate was basically trial and error. I thought the reason why the model's error rate dropped after a few epochs is because the learning rate (which reflects the size of the steps the model would take during searching for the best solution) was too big, so the model would take really big steps in some direction and be thrown off course for the solution. So I kept decreasing the learning rate until I found a value with which the model's making decent-sized steps to be able to get to a good solution.

That's about it, hope I was able to explain it well :p

ahaahaahaahaa commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest. (Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

Hi! Thank you for your reply! I realize your settings. There is no need to apologize😃, and I have set up a conda environment. But may I ask how you set it up locally with out conda?😮 From the checkpoint released by the author, I see the learning rate of his environment is 0.0003 truly. So it seems like he did not make mistakes in command. But why many of us have to adjust it to 0.00003 and how do you find this number? Do you have any idea about this?😳 Looking forward to your reply! Thank you a lot!

Hi! As for my environment, I just use pip to install my packages (just the way im used to do it, I never really tried to use conda before :p)

For the learning rate, I'm not sure why the author's 0.0003 worked for them (Maybe other hyperparameters were adjusted?), but for our case here how I found this learning rate was basically trial and error. I thought the reason why the model's error rate dropped after a few epochs is because the learning rate (which reflects the size of the steps the model would take during searching for the best solution) was too big, so the model would take really big steps in some direction and be thrown off course for the solution. So I kept decreasing the learning rate until I found a value with which the model's making decent-sized steps to be able to get to a good solution.

That's about it, hope I was able to explain it well :p

HI! I noticed that you set up an environment without Docker. Could you share the packages you used and the environment settings such as python and torch version ?

Looking forward to your reply!!!

ahaahaahaahaa commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest. (Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

Hi! Thank you for your reply! I realize your settings.

There is no need to apologize😃, and I have set up a conda environment. But may I ask how you set it up locally with out conda?😮

From the checkpoint released by the author, I see the learning rate of his environment is 0.0003 truly. So it seems like he did not make mistakes in command. But why many of us have to adjust it to 0.00003 and how do you find this number? Do you have any idea about this?😳

Looking forward to your reply! Thank you a lot!

HI! I noticed that you set up a conda environment instead of Docker. Could you share the packages you used and the environment settings such as python and torch verssion ?

Looking forward to your reply!!!

tiesanguaixia commented 1 year ago

For anyone who may trip up on this in the future, What worked for me was reducing the learning rate in the training command. For my own custom dataset I set the learning rate to 0.00003 instead of 0.0003 and the training worked no problem.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' in command? For the 'loss_sparse_w', I guess it's the regularization hyperparameter of LossSPARSE , i.e. the λ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when λ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!

Hello! I used the model on a custom dataset personally, so I haven't reproduced the results myself. I used the default value for sparse loss (0.5) for my case, but I'm not sure what the optimal value would be to be honest. (Also, I just noticed your previous comments, sorry about that😅. I don't know if you solved these issues, but i didn't set up a conda environment myself unfortunately so I don't think I can be much help with that)

Hi! Thank you for your reply! I realize your settings. There is no need to apologize😃, and I have set up a conda environment. But may I ask how you set it up locally with out conda?😮 From the checkpoint released by the author, I see the learning rate of his environment is 0.0003 truly. So it seems like he did not make mistakes in command. But why many of us have to adjust it to 0.00003 and how do you find this number? Do you have any idea about this?😳 Looking forward to your reply! Thank you a lot!

HI! I noticed that you set up a conda environment instead of Docker. Could you share the packages you used and the environment settings such as python and torch verssion ?

Looking forward to your reply!!!

Hi, I generate the requirements from docker and install some other packages when encounter bugs.

tiesanguaixia commented 1 year ago

It looks like the default yaml file is using a different naming convention, which is why it's looking for the wrong filename. I faced a similar issue with the VATEX annotations.

Open the msrvtt_8frm_default.json file in the src/config/VidSwinBert, find the "train_yaml" and "val_yaml" attribute and remove the '_32frames' from the filename (so they become train.yaml and val.yaml). It should find the correct files afterwards.

Hi! I am sorry to bother, May I ask how to download the raw videos of VATEX?

Alwen233 commented 1 year ago

我也遇到了同样的问题,使用MSVD数据集进行训练,评分结果非常低,自己训练的模型预测结果也非常差,不知道你解决了没有