microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 34 forks source link

Inconsistent reproduced results with the results reported in the paper #12

Closed QwCao closed 2 years ago

QwCao commented 2 years ago

Hi, when I reproduced the test process, I found that the results were inconsistent with the results reported in your paper. Especially on MSVD dataset, there was a large gap. When I use test.yaml as the "--val_yaml", the result I got is CIDEr: 109.4, which is the same as the result you reported on github page (https://github.com/microsoft/SwinBERT). When I use test_32frames.yaml, the result I got is CIDEr: 120.6, which is also the same as the result you reported on github page (https://github.com/microsoft/SwinBERT).

However, these two results are far from the results you reported in the paper (CIDEr: 149.4). Is there any parameter setting or tricks during the test? And why is there such a big performance gap?

kevinlin311tw commented 2 years ago

The 149.4 CIDEr (in the main text, Table 5) is obtained by using 64 frames as input. In addition, the ablation study is conducted on MSVD val split, but not test split.

You could evaluate the model on val split, and it should generate results similar to 149 CIDEr.

QwCao commented 2 years ago

The 149.4 CIDEr (in the main text, Table 5) is obtained by using 64 frames as input. In addition, the ablation study is conducted on MSVD val split, but not test split.

You could evaluate the model on val split, and it should generate results similar to 149 CIDEr.

Ohhhh! Thank you! I Got It!

And, are the experimental results in Table 2 and Table 3 all on the val set?

kevinlin311tw commented 2 years ago

Table 2 and Table 3 are the comparison with previous studies. Thus, we follow the splits used in previous studies for comparison. Each dataset has a different split setting. See below.

MSVD: People use test split for comparison. Thus we report test split results in Table 2. MSRVTT: People use test split for comparison. Table 2 shows our test split results. VATEX: People mainly use public test split for comparison. Results in Table 3 are from public test split. TVC: No test split available. The results are from val split. YouCook2: People mainly use val split for comparison, and Table 3 shows the val split results.

QwCao commented 2 years ago

Table 2 and Table 3 are the comparison with previous studies. Thus, we follow the splits used in previous studies for comparison. Each dataset has a different split setting. See below.

MSVD: People use test split for comparison. Thus we report test split results in Table 2. MSRVTT: People use test split for comparison. Table 2 shows our test split results. VATEX: People mainly use public test split for comparison. Results in Table 3 are from public test split. TVC: No test split available. The results are from val split. YouCook2: People mainly use val split for comparison, and Table 3 shows the val split results.

Thank you for your reply! But I noticed that you reported the result CIDEr:120.6 on README, also on MSVD test set. I wonder what is the difference between the settings of the two experimental results (CIDEr:120.6 on README and CIDEr:149.4 on your paper)?

kevinlin311tw commented 2 years ago

In README, we report results on all splits. Please check our model card in README: 120.6 CIDEr on test set, and 160 on val set. This model corresponds to Table 6(b) in the main text.

149.4 val CIDEr (main text, Table 5): It is from val split, based on 64-frame inputs. We conduct ablation study on val set. 120.6 test CIDEr & 160 val CIDEr (main text, Table 6(b)): It is with VATEX--> MSVD transfer, based on 32-frame inputs.

QwCao commented 2 years ago

In README, we report results on all splits. Please check our model card in README: 120.6 CIDEr on test set, and 160 on val set. This model corresponds to Table 6(b) in the main text.

149.4 val CIDEr (main text, Table 5): It is from val split, based on 64-frame inputs. We conduct ablation study on val set. 120.6 test CIDEr & 160 val CIDEr (main text, Table 6(b)): It is with VATEX--> MSVD transfer, based on 32-frame inputs.

Thank you for your detailed explanation! And I have another question. The experimental result ( CIDEr: 149.4) on MSVD reported in Table 2 you mentioned above are obtained on test set, based on 64-frame sampling. And the corresponding experimental result in Table 5 are the same (CIDEr: 149.4), but this experiment is implemented on val set.

However, as we can see, there is a performance drop between test and val. Therefore, I don't know whether there is a typo when recording these results or what causes it.

kevinlin311tw commented 2 years ago

I think you are referring to the early arxiv, which indeed has typos. Please check the latest arxiv with correct numbers. Thanks.

QwCao commented 2 years ago

I think you are referring to the early arxiv, which indeed has typos. Please check the latest arxiv with correct numbers. Thanks.

OMG OKayyyy Thanks for reply

QwCao commented 2 years ago

I think you are referring to the early arxiv, which indeed has typos. Please check the latest arxiv with correct numbers. Thanks.

OMG I have seen the updated version. And the modified experimental results on MSVD dataset are the part that puzzles me for a long time.

QwCao commented 2 years ago

I think you are referring to the early arxiv, which indeed has typos. Please check the latest arxiv with correct numbers. Thanks.

Sorry for disturbing you again. Are the experiments in Table 2 and Table 3 based on 64 frame sampling?

kevinlin311tw commented 2 years ago

Table 2 results are based on 32 frames. Table 3 results are based on 64 frames.

BTW, I believe Table 2 results can be greatly improved with 64 frames. But more parameter tunning needs to be done. We didn't try all the grid search, due to limited time during paper submission.

QwCao commented 2 years ago

Thanks for your detailed explanation!

QwCao commented 2 years ago

Table 2 results are based on 32 frames. Table 3 results are based on 64 frames.

BTW, I believe Table 2 results can be greatly improved with 64 frames. But more parameter tunning needs to be done. We didn't try all the grid search, due to limited time during paper submission.

Hi, It's me. I have a new puzzle, and hope you can help me solve it.

I evaluated the best-checkpoint released on your GitHub, based on 32-frames. However, I only got CIDEr:109 (consistent with the 32-frames result in README), which is far lower than CIDEr:120.6 recorded in Table 2 on the paper.

And as you mentioned above, Table 2 reports the experiment implemented based on 32-frames. So, I wonder what caused this performance gap?

kevinlin311tw commented 2 years ago

The two provided MSVD checkpoints are all based on 32-frames. The one with 120.6 CIDEr is additionally with VATEX-->MSVD transfer.

kevinlin311tw commented 2 years ago

Closing the issue as the main problem has been addressed by referencing the latest arxiv. Additional questions are more about other training configurations, which can also be found in the provided config files.

QwCao commented 2 years ago

The two provided MSVD checkpoints are all based on 32-frames. The one with 120.6 CIDEr is additionally with VATEX-->MSVD transfer.

Ok But, have the other experimental results in Table 2 and Table 3 implemented data transfer? If yes, please mark it out. Thanks.

tiesanguaixia commented 1 year ago

The two provided MSVD checkpoints are all based on 32-frames. The one with 120.6 CIDEr is additionally with VATEX-->MSVD transfer.

Ok But, have the other experimental results in Table 2 and Table 3 implemented data transfer? If yes, please mark it out. Thanks.

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' and the 'learning_rate' in command? For the 'loss_sparsew', I guess it's the regularization hyperparameter of $Loss{SPARSE}$ , i.e. the $\lambda$ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!