Closed windxrz closed 2 years ago
In addition, I wonder whether you split the dataset into binary and open-ended questions and train two models separately? Thanks!
Hi, thanks for reaching out. We used the parameters that are in this repo for each of the baselines. We did not split the dataset into binary and open ended questions, but instead trained them all together.
Are you using the version of the dataset on the website with the description: "This version of the benchmark is showcased in the CVPR 2021 paper."? We have released an updated version of the dataset since then that may not exactly reflect the results in the CVPR paper.
All the hyperparameters should be identical to the original model papers, with the exception of HCRN weight decay and Dropout, which we added to reduce overfitting. Our hyperparameters were as follows:
HCRN
HME
PSAC
@madeleinegrunde Thanks for your response and I will try to train the model again. In addition, I have other questions about the version of AGQA. On the AGQA website, there are three versions, including the
What is the difference between these versions? Would you please release the baseline results on other versions of the dataset? In addition, would you please release the baseline results on the unbalanced dataset (including both the large and small versions)? Thanks in advance.
In addition, I wonder if choose
and duration comparison
questions should be treated as binary problems. The answers to these problems are not yes / no nor before / after but the specific object or relationship that needed to be extracted from the question texts. By contrast, the answers to other binary questions are all yes / no or before / after.
Hi, thank you for your questions.
The differences between the benchmarks is as follows:
We did not release the baseline results on the version with programs and scene graphs because the distributions of answers were the same. However, if you are getting differing results I can look into that version of the benchmark. We will not be releasing baseline results on the unbalanced dataset, as we do not expect it to be used to train models, but instead as a thorough database of questions. Additionally, the large unbalanced dataset would be unwieldy to train because it is so large.
As for the decision to count choose and duration comparison questions as binary problems, we debated what their categorization should be, but in the end decided to make them binary because the answer is often in the question. Therefore, it did not seem correct to count them as open answer questions when there is guidance in the answers. In our results, models did rely heavily on the question text in their answers, indicating that the presence of two answer options in the question does affect performance.
@madeleinegrunde Thanks for your responses and they solve my confusion! And I kindly remind you that the download url of AGQA Benchmark with programs and scene graphs
on the AGQA website is linked to the old version on google drive.
Thank you for the reminder! I will fix that shortly.
Hi @madeleinegrunde , I have another question. Does the version of GloVe affect the results of baselines? Which version does these baselines adopt?
We adopted the same GloVe version as specified in each of the model's original repos. We have not tested if the version of GloVe affects the results of the baseline.
@madeleinegrunde Thanks for your response and I have reproduced similar results with the paper. I find that a 1e-3 weight decay is too large for HCRN and we use a 1e-5 instead.
Thanks for the discussion, and thanks @madeleinegrunde for creating such a dataset !! I was wondering for all of the models, how long (epochs or iterations) did they have to run before the their highest validation score was reached?
We ran all of the models until they converged, then tested on the model with the highest validation score. I do not remember for exactly how many iterations each ran, but HME does take a long time (#4 and #5) also ran into this problem.
We ran all of the models until they converged, then tested on the model with the highest validation score. I do not remember for exactly how many iterations each ran, but HME does take a long time (#4 and #5) also ran into this problem.
Would you please provide the detailed hyper parameters (learning rate, batch size, momentum, and weight decay) for each baselines? Thanks in advance.