madeleinegrunde / AGQA_baselines_code

MIT License
18 stars 4 forks source link

Training time consumption for HME #5

Closed Tangolin closed 2 years ago

Tangolin commented 3 years ago

First offm thank you for sharing such a comprehensive repo, everything is really organised and I really appreciate the effort! I understand that the long training time problem has been raised in a different issue. However when I was experimenting with HME, the training time is insanely long, I am running it on a GTX Titan X GPU and it has been training for more than a week, yet it has not even completed the first epoch. I am wondering if there's something wrong in my configurations.

What I have on my end is 1000 epochs and in each epoch, it contains 79632 batches. Can I check if this is what you guys got on your end as well?

madeleinegrunde commented 3 years ago

Hi, thank you for reaching out. We encountered the same issue and use the same parameters. We looked at the validation scores every 100 epochs. In the file HME-VideoQA/gif-qa/saved_models/FrameQA_concat_fc_mrm2s/ there are multiple models saved using the format: rnn-[ITER]-[LOSS]-[ACCURACY].pkl. Once the loss and accuracy of the validation set were no longer improving, we tested the model.

Tangolin commented 3 years ago

I see, may I know your training time for every epoch? Because I have left the model running for almost a week now and it has yet to finish a single epoch, I am not sure if it is because of the huge number of batches.

madeleinegrunde commented 3 years ago

That sounds consistent with our training time. Our model peaked in validation accuracy and began decreasing before the end of the first epoch, so we took the model with the highest validation accuracy to minimize overfitting. Of your saved models so far, do you see a similar trend?

Tangolin commented 2 years ago

I see! I will go check it out. Thanks!