Closed nalzok closed 2 years ago
Hi! Thanks for your interest. Yes you are right that the size of the train is the size of the correlated distribution ($M$) plus $N$ samples from the uncorrelated distribution.
In particular, the six datasets contain different numbers of samples, but you appear to use the same for all of them. Am I correct?
Yes you are right. They are of different sizes. The subsets of the original datasets are given in the supplementary (Table 2). From there you can deduce the size of the biased / not biased subsets.
As a separate question, could you share some intuitions on why heuristic augmentations could lead to degraded performance?
My hypothesis is that the model wastes capacity and features learning these unnecessary augmentations that then hurt performance later. Also an augmentation could potentially change a label (e.g. color jitter would change a color label). This could also hurt performance.
I hope those answers help!
I see. Thanks a lot!
Hi there, I would like to confirm the meaning of the $N$ axis in the figures below.
In the appendix D.2, you wrote
Could you clarify what is the size of the "new dataset with all samples from the correlated distribution in the dataset"? Denote the size of that dataset with $M$. Does it mean the training set has a total of $M + N$ images?
It would be helpful to learn the (relative) cardinality of $\mathbb{A}^{a}_c$ and $\mathbb{A}^{a}$, or in other words, what is the number of samples whose attributes belong to $\mathbb{A}^{a}$? I think knowing the exact number of $N$ along is not enough, and we need to know the relative size of $N$ to understand how unbalanced the training set is. In particular, the six datasets contain different numbers of samples, but you appear to use the same $N$ for all of them. Am I correct?
As a separate question, could you share some intuitions on why heuristic augmentations could lead to degraded performance? I am surprised by the fact that heuristic augmentations lower the test accuracy by up to ~30% under Spurious Correlation, and up to ~15% under Low-data drift. I am under the impression that while they may not help generalization when they don't approximate part of the true underlying generative model, they typically don't hurt generalization either.
Thanks in advance!