thaolmk54 / hcrn-videoqa

Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)
Apache License 2.0
130 stars 26 forks source link

Training on TGIF-QA / FrameQA #3

Closed antoyang closed 4 years ago

antoyang commented 4 years ago

Hi,

Thanks for your great work. I have no problem using the code for MSVD-QA / MSRVTT-QA / the 3 other tasks of TGIF-QA, but as I train on TGIF-QA for FrameQA subtask, the loss quickly becomes nan (after about 80% of the first epoch), and the accuracy is 0. Do you have an idea why it happens?

thaolmk54 commented 4 years ago

Hi,

Thank you for your interest in my work and sorry for the issue you got. My guess is that it is due to the gradients exploding. I could not reproduce the issue on my PC so I'm not sure if the following solution helps. Can you please try to set the max_norm in gradient clipping (nn.utils.clip_gradnorm in train.py) to a lower value, say 8 or even lower? Also, did you use my extracted features or you did it yourself?

antoyang commented 4 years ago

I tried to reduce the gradient clipping (up to 3) but it keeps on happening. I do use your extracted visual features and processed the linguistic features as indicated.

thaolmk54 commented 4 years ago

Hi,

Thank you for reporting the issue. It looks like I uploaded the wrong files that cause the problem. I've just re-uploaded new files. Can you please download the new files and try again?

Thank you!

antoyang commented 4 years ago

Thanks for the quick fix, it is now indeed working :)

yaoshentao commented 4 years ago

Hi, I have another question. And what is the accuracy of the TGIF-QA dataset you have trained on your machine? The accuracy of the training on my machine is very different from the paper.

thaolmk54 commented 4 years ago

Hi,

The accuracy is pretty much the same as those reported in the paper. There are some small differences (some better, some worse) between my public code and local code but they should not be too large.

I would recommend you to extract visual features yourself using commands in the README to reproduce performance reported in the paper. If you find any issue, please feel free to open an issue on github.

Thanks.

yaoshentao commented 4 years ago

Okay, Thanks.

thaolmk54 commented 4 years ago

Hi,

I downloaded the code as well as provided extracted features and run all experiments on TGIF-QA again. Here are the results that I got: Action: 75.8 ; Transition: 82.1; Count: 3.83; FrameQA: 55.8

Cheers.

yaoshentao commented 4 years ago

Hi,

Thank you for your answer, because I tested three times and the results are different. This should be my mistake, I will re-download all files and re-train them.