Closed tms-bananaquit closed 1 year ago
Among other concepts from reviewing existing work:
Straightforward short-term efforts that also don't appear to exist in OS code:
Per discussion, for Dirichlet, tweak-one, and minority shift, the general pattern is:
Does the initial train/validation/test split need to be stratified, then? Not stratifying it (i.e. the split data may have a totally random proportion of the class values) seems to interfere with the second step.
I'm pretty sure I should be stratifying before doing the class manipulation but just checking with @indialindsay or @tms-bananaquit
The particular experiments we're looking at might be unrealistic in that sense. The training set is presumed to be uniformly distributed over the classes - since any split is (practically) coming from the same data, the distribution after stratification "should" also be uniform in each set. Then it gets perturbed (resampled) using one of the schemes above.
It's a good point that the split data might have some randomly terrible empirical distribution, though. Practically I think we need to have enough samples in each (rule of thumb 10k) that they are approximately uniform, matching the original distribution. To be really picky, hypothesis testing scipy.stats.chisquare
could confirm after splitting. Or just spit out a histogram.
That implies the dataset we're sampling from should at least be large enough to stratify into 10k/10k/80k or so without doing resampling on the front end, though.
Starting from legitimately uniform test data might be a problem in itself. We can just synthesize something for the purposes of testing, but past that we should probably go hunting for an appropriate set.
Revisiting this, I’m recognizing that these methods were intended to introduce drift into data for the context of training a model. These methods introduce drift by adjusting the proportions of classes so that a certain class will be less prevalent in data used to train and validate the model but then this class will appear with greater proportion in the test set. It seems to be for the purpose of identifying how a model is affected if a minority class switches to a majority class in incoming data. They switch the class proportions using these three methods.
For our purposes, we want to be able to inject synthetic drift into any dataset, regardless of whether it is used with a model or not. We could probably skip the first two steps (1. Train/test/validate split and 2. Sample from those to create smaller subsets). Rather, at a specified index, we could use these three methods to reduce (or increase) the prevalence of a user-specified label.
If we set it up this way, then a user could technically use this to change the prevalence of a class in just the test dataset and use it in the paper’s intended manner. Thoughts on this @Anmol-Srivastava @tms-bananaquit ?
In the failing loudly setup for injecting drift, do you know if they stratify prior? And do they do the first two steps of diving into train/test/validate split and then sampling from those?
@indialindsay's approach makes a lot of sense to me, and I can switch to that kind of function which is similar to those we've already written. I'll check about failingloudly
's behavior re: stratification and see if that requires further thought
Question about Dirichlet method, given the basic algorithm:
This is the numpy.random.Dirichlet
utility. I believe np.dirichlet(alpha=[class_1, ... class_k])
provides the right shape
class_1
, etc., expected values for each label? Pinging @indialindsay for any thoughts
Yes, I believe alpha should be a vector. Looking over this paper that also implemented it, they set alpha = [0.1, 1.0, 10] and smaller alpha results in a larger shift.
Various papers differ in the value they set alpha too. I'm okay with letting the user specify their values. We could set the defaults to be [0.1, 1.0, 10] ? Though if this is dependent on the number of classes in the data, that can be tricky for setting defaults.
It appears each value in that vector is the proportion of each class sampled. As you can specify size K, i'm assuming this might get divided by K to make the proportion, matched # 2 equation? The np documentation is quite uninformative so I'm not confident in this, I'll take a closer look at the source code.
In the meantime, it could be helpful to test this out on a dataset to see how it affects the proportions of each class?