Synthetic noise injection for drift detector parameter selection

tms-bananaquit commented 2 years ago

Related to #123.
Realistic drift scenarios probably rely heavily on domain knowledge for the data.
There's a lot of related work in robustness/sensitivity analysis/etc. We should do a broader survey of methods to get a scope of the work and the potential payoff for the effort.

Anmol-Srivastava commented 1 year ago

Among other concepts from reviewing existing work:

image perturbations and adversarial noise is likely handled well by CleverHans

Straightforward short-term efforts that also don't appear to exist in OS code:

Dirichlet shift
Tweak-one shift
Minority class shift

Anmol-Srivastava commented 1 year ago

Per discussion, for Dirichlet, tweak-one, and minority shift, the general pattern is:

Split data into train, validate, test
Sample from each with replacement to create smaller subsets - this sampling varies by the above 3 methods

Does the initial train/validation/test split need to be stratified, then? Not stratifying it (i.e. the split data may have a totally random proportion of the class values) seems to interfere with the second step.

I'm pretty sure I should be stratifying before doing the class manipulation but just checking with @indialindsay or @tms-bananaquit

tms-bananaquit commented 1 year ago

The particular experiments we're looking at might be unrealistic in that sense. The training set is presumed to be uniformly distributed over the classes - since any split is (practically) coming from the same data, the distribution after stratification "should" also be uniform in each set. Then it gets perturbed (resampled) using one of the schemes above.

It's a good point that the split data might have some randomly terrible empirical distribution, though. Practically I think we need to have enough samples in each (rule of thumb 10k) that they are approximately uniform, matching the original distribution. To be really picky, hypothesis testing scipy.stats.chisquare could confirm after splitting. Or just spit out a histogram.

That implies the dataset we're sampling from should at least be large enough to stratify into 10k/10k/80k or so without doing resampling on the front end, though.

Starting from legitimately uniform test data might be a problem in itself. We can just synthesize something for the purposes of testing, but past that we should probably go hunting for an appropriate set.

indialindsay commented 1 year ago

Revisiting this, I’m recognizing that these methods were intended to introduce drift into data for the context of training a model. These methods introduce drift by adjusting the proportions of classes so that a certain class will be less prevalent in data used to train and validate the model but then this class will appear with greater proportion in the test set. It seems to be for the purpose of identifying how a model is affected if a minority class switches to a majority class in incoming data. They switch the class proportions using these three methods.

For our purposes, we want to be able to inject synthetic drift into any dataset, regardless of whether it is used with a model or not. We could probably skip the first two steps (1. Train/test/validate split and 2. Sample from those to create smaller subsets). Rather, at a specified index, we could use these three methods to reduce (or increase) the prevalence of a user-specified label.

If we set it up this way, then a user could technically use this to change the prevalence of a class in just the test dataset and use it in the paper’s intended manner. Thoughts on this @Anmol-Srivastava @tms-bananaquit ?

In the failing loudly setup for injecting drift, do you know if they stratify prior? And do they do the first two steps of diving into train/test/validate split and then sampling from those?

Anmol-Srivastava commented 1 year ago

@indialindsay's approach makes a lot of sense to me, and I can switch to that kind of function which is similar to those we've already written. I'll check about failingloudly's behavior re: stratification and see if that requires further thought

Anmol-Srivastava commented 1 year ago

Question about Dirichlet method, given the basic algorithm:

create a Dirichlet distribution over the classes, such that each label has p(class)
create a distribution over all points in the window, where each point has p = p(class) / class size in window
use 2. to perform the resampling with replacement from index i to j

This is the numpy.random.Dirichlet utility. I believe np.dirichlet(alpha=[class_1, ... class_k]) provides the right shape

are class_1, etc., expected values for each label?
should we just let the user specify alpha, since the paper seems to vary alpha values
the excerpt from the paper talks about alpha in scalar terms, but I assume it should be a vector as above?

Pinging @indialindsay for any thoughts

indialindsay commented 1 year ago

Yes, I believe alpha should be a vector. Looking over this paper that also implemented it, they set alpha = [0.1, 1.0, 10] and smaller alpha results in a larger shift.

Various papers differ in the value they set alpha too. I'm okay with letting the user specify their values. We could set the defaults to be [0.1, 1.0, 10] ? Though if this is dependent on the number of classes in the data, that can be tricky for setting defaults.

It appears each value in that vector is the proportion of each class sampled. As you can specify size K, i'm assuming this might get divided by K to make the proportion, matched # 2 equation? The np documentation is quite uninformative so I'm not confident in this, I'll take a closer look at the source code.

In the meantime, it could be helpful to test this out on a dataset to see how it affects the proportions of each class?

mitre / menelaus

Synthetic noise injection for drift detector parameter selection #125