gqa-ood / GQA-OOD

GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.
26 stars 1 forks source link

About the imbalanced groups discarding #2

Open PhoebusSi opened 2 years ago

PhoebusSi commented 2 years ago

In the paper, you say that "we keep groups with a normalized entropy smaller than a threshold empirically set to T =0.9." However, in your code, why set the imbalanced_threshold as "mean - std" ?

gqa-ood commented 2 years ago

Dear @PhoebusSi,

Thank you for your interest in our benchmark.

The purpose of this threshold is to discard question groups which are not imbalanced enough. And you are right, the threshold used in the code is "mean - std". This should normally be very close to 0.9. As it has been set empirically, in the paper we only give its value T=0.9.

BR, Corentin K.

PhoebusSi commented 2 years ago

Dear Corentin K, Thank you for your reply. I am still confused about the "empirical setting of T". When I try to apply this ood-dataset-building scheme to other tasks' dataset, should I continue to set the imbalanced threshold T = 0.9 or calculate T by 'mean-std'? If the answer is latter, why 'mean-std' can be a criterion used to determine whether it is less imbalanced?

We would be appreciated if this issue could raise your attention. Thank you very much!

Phoebus Si.

gqa-ood commented 2 years ago

HI Phoebus,

Sorry for my late answer...

I would say that the value of the threshold T depends on the dataset you are using. I recommand you to test several values and keep the one that seems the best for you.

As a recall, the normalized entropy that we are using informs us on how imbalanced are the groups. If the normalized entropy is close to 1, it means that the group distribution is uniform. Inversely, if the value is small, it means that the group is highly imbalanced. Thus, the goal of the threshold T is to discard the groups which are not imbalanced enough. If T is too small, then you will discard interesting imbalanced groups. On the contrary, if T is too high, you will keep groups with a uniform distribution, where it is not possible to do the head/tail split.

The problem is that the questions groups in GQA are very diverse. Some question groups admit only 2 answers (e.g. yes/no) while others contains more than 10 (e.g. 'what color' questions). Therefore, despite the normalized entropy, it is still hard to find a good value for T.

As a consequence, we found that the best solution was to do an empiric search: we empirically tried several values and found that T=0.9 was the best.

TL;DR: Should I continue to set the imbalanced threshold T = 0.9 or calculate T by 'mean-std'? Forget 'mean-std', try various values of T (T=0.9 should be a good starting point) and keep the one that is the best given your data. If you find a good rule for setting T, don't hesitate to share it!

I hope that I have answered your question :)

Corentin K.