Cannot reproduce the results in the paper

windxrz commented 2 years ago

Would you please provide the detailed hyper parameters (learning rate, batch size, momentum, and weight decay) for each baselines? Thanks in advance.

windxrz commented 2 years ago

In addition, I wonder whether you split the dataset into binary and open-ended questions and train two models separately? Thanks!

madeleinegrunde commented 2 years ago

Hi, thanks for reaching out. We used the parameters that are in this repo for each of the baselines. We did not split the dataset into binary and open ended questions, but instead trained them all together.

Are you using the version of the dataset on the website with the description: "This version of the benchmark is showcased in the CVPR 2021 paper."? We have released an updated version of the dataset since then that may not exactly reflect the results in the CVPR paper.

All the hyperparameters should be identical to the original model papers, with the exception of HCRN weight decay and Dropout, which we added to reduce overfitting. Our hyperparameters were as follows:

HCRN

learning rate: 0.0001
batch size: 32
momentum: reduce lr by 0.5
weight decay: 1e-3 (added to reduce overfitting)
dropout: values in this repo's HCRN.py

HME

learning rate: 0.001
batch size: 32
momentum: 0.9
weight decay: None

PSAC

learning rate: 0.05
batch size: 32
momentum: 0.9
weight decay: None

windxrz commented 2 years ago

@madeleinegrunde Thanks for your response and I will try to train the model again. In addition, I have other questions about the version of AGQA. On the AGQA website, there are three versions, including the

AGQA Benchmark with programs & scene graphs
AGQA Benchmark
Data in google drive.

What is the difference between these versions? Would you please release the baseline results on other versions of the dataset? In addition, would you please release the baseline results on the unbalanced dataset (including both the large and small versions)? Thanks in advance.

windxrz commented 2 years ago

In addition, I wonder if choose and duration comparison questions should be treated as binary problems. The answers to these problems are not yes / no nor before / after but the specific object or relationship that needed to be extracted from the question texts. By contrast, the answers to other binary questions are all yes / no or before / after.

madeleinegrunde commented 2 years ago

Hi, thank you for your questions.

The differences between the benchmarks is as follows:

AGQA Benchmark with programs and scene graphs includes the additional program and scene graph data, but we needed to re-generate questions for this additional data. This is the most updated version of the dataset.
AGQA Benchmark is the version of the benchmark that has the baseline results published in CVPR 2021.
On google drive, the AGQA Benchmark folder holds the data in in version # 1 (with programs and scene graphs). The Previous versions folder holds a benchmark that we originally released with the programs and scene graphs, but it had some formatting bugs that we fixed in the current version (# 1).

We did not release the baseline results on the version with programs and scene graphs because the distributions of answers were the same. However, if you are getting differing results I can look into that version of the benchmark. We will not be releasing baseline results on the unbalanced dataset, as we do not expect it to be used to train models, but instead as a thorough database of questions. Additionally, the large unbalanced dataset would be unwieldy to train because it is so large.

As for the decision to count choose and duration comparison questions as binary problems, we debated what their categorization should be, but in the end decided to make them binary because the answer is often in the question. Therefore, it did not seem correct to count them as open answer questions when there is guidance in the answers. In our results, models did rely heavily on the question text in their answers, indicating that the presence of two answer options in the question does affect performance.

windxrz commented 2 years ago

@madeleinegrunde Thanks for your responses and they solve my confusion! And I kindly remind you that the download url of AGQA Benchmark with programs and scene graphs on the AGQA website is linked to the old version on google drive.

madeleinegrunde commented 2 years ago

Thank you for the reminder! I will fix that shortly.

windxrz commented 2 years ago

Hi @madeleinegrunde , I have another question. Does the version of GloVe affect the results of baselines? Which version does these baselines adopt?

madeleinegrunde commented 2 years ago

We adopted the same GloVe version as specified in each of the model's original repos. We have not tested if the version of GloVe affects the results of the baseline.

windxrz commented 2 years ago

@madeleinegrunde Thanks for your response and I have reproduced similar results with the paper. I find that a 1e-3 weight decay is too large for HCRN and we use a 1e-5 instead.

andlyu commented 2 years ago

Thanks for the discussion, and thanks @madeleinegrunde for creating such a dataset !! I was wondering for all of the models, how long (epochs or iterations) did they have to run before the their highest validation score was reached?

madeleinegrunde commented 2 years ago

We ran all of the models until they converged, then tested on the model with the highest validation score. I do not remember for exactly how many iterations each ran, but HME does take a long time (#4 and #5) also ran into this problem.

madeleinegrunde commented 2 years ago

We ran all of the models until they converged, then tested on the model with the highest validation score. I do not remember for exactly how many iterations each ran, but HME does take a long time (#4 and #5) also ran into this problem.

madeleinegrunde / AGQA_baselines_code

Cannot reproduce the results in the paper #8