google-deepmind / distribution_shift_framework

This repository contains the code of the distribution shift framework presented in A Fine-Grained Analysis on Distribution Shift (Wiles et al., 2022).
Apache License 2.0
82 stars 8 forks source link

Clarification on the meaning of Figure 3 and Figure 4 #3

Closed nalzok closed 2 years ago

nalzok commented 2 years ago

Hi there, I would like to confirm the meaning of the $N$ axis in the figures below.

Screen Shot 2022-05-28 at 01 04 51

In the appendix D.2, you wrote

Under spurious correlation, we correlate $y^l, y^a$. At test time, these attributes are uncorrelated. We vary the amount of correlation by creating a new dataset with all samples from the correlated distribution in the dataset and $ N$ samples from the uncorrelated distribution; this forms the training set. We set $N >= 1$ (as if $N = 0$, then the problem is ill defined as to what is the correct label). The test set is composed of sampling from the uncorrelated distribution and is disjoint from the training samples.

Could you clarify what is the size of the "new dataset with all samples from the correlated distribution in the dataset"? Denote the size of that dataset with $M$. Does it mean the training set has a total of $M + N$ images?

Under low-data drift, we consider the set $\mathbb{A}^{a}$. For some subset $\mathbb{A}^{a}_c \subset \mathbb{A}^{a}$, we only see $N$ samples of those attributes. For all other values of $\mathbb{A}^{a}$ ($\mathbb{A}^{a} / \mathbb{A}^{a}_c$), the model has access to all samples.

It would be helpful to learn the (relative) cardinality of $\mathbb{A}^{a}_c$ and $\mathbb{A}^{a}$, or in other words, what is the number of samples whose attributes belong to $\mathbb{A}^{a}$? I think knowing the exact number of $N$ along is not enough, and we need to know the relative size of $N$ to understand how unbalanced the training set is. In particular, the six datasets contain different numbers of samples, but you appear to use the same $N$ for all of them. Am I correct?

As a separate question, could you share some intuitions on why heuristic augmentations could lead to degraded performance? I am surprised by the fact that heuristic augmentations lower the test accuracy by up to ~30% under Spurious Correlation, and up to ~15% under Low-data drift. I am under the impression that while they may not help generalization when they don't approximate part of the true underlying generative model, they typically don't hurt generalization either.

Thanks in advance!

oawiles commented 2 years ago

Hi! Thanks for your interest. Yes you are right that the size of the train is the size of the correlated distribution ($M$) plus $N$ samples from the uncorrelated distribution.

In particular, the six datasets contain different numbers of samples, but you appear to use the same for all of them. Am I correct?

Yes you are right. They are of different sizes. The subsets of the original datasets are given in the supplementary (Table 2). From there you can deduce the size of the biased / not biased subsets.

As a separate question, could you share some intuitions on why heuristic augmentations could lead to degraded performance?

My hypothesis is that the model wastes capacity and features learning these unnecessary augmentations that then hurt performance later. Also an augmentation could potentially change a label (e.g. color jitter would change a color label). This could also hurt performance.

I hope those answers help!

nalzok commented 2 years ago

I see. Thanks a lot!