koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.24k stars 117 forks source link

[FEATURE] CV solution for anomaly detection without outliers during training #307

Open janvdvegt opened 4 years ago

janvdvegt commented 4 years ago

With anomaly detection, if you have labeled outliers there are two types of models. The first one requires the outliers, although regularly unlabeled. Isolation forests fall in this category. One class SVM specifically works better without the outliers in there. Properly evaluating this model does require the outliers though. The current sklearn setup does not allow for this case (I believe). It would be nice to have a way to do this easily.

One possible approach would be to use a different type of validation iterator, that returns only negative sample indices in the training fold but both in the validation fold.

koaning commented 4 years ago

Just to confirm. The proposal is to create a new CV method that accepts an outlier detector as part of it's initialisation?

I certainly see some merit to this idea. Got an example of what the API might look like?

janvdvegt commented 4 years ago

I think I mean something slightly different. There are outlier detection methods that only work when there are no outliers in the training data, in that sense, they are more like novelty detection. Of course, these outliers are very important for properly evaluating the hyperparameters and performance in general. So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not. y is important for evaluation but we don't use it during training. However, in these novelty algorithms we want to throw out the positive samples in the training set, so in our CV loop.

Let's say we have the following dataset:

X   y (anomalous)
0   0
1   1
2   0
3   0
4   0
5   1
6   0

If the first four samples are in the training split we want to remove sample 1 because we don't want outliers in our training set but in the validation split we do not want to remove 5 because we need it for proper evaluation.

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

koaning commented 4 years ago

Just for confirmation. This is the situation?

X   y (anomalous)   y (to predict)  split
0   0               0               A
1   1               1000            A
2   0               2               A
3   0               3               B
4   0               4               B
5   1               10000           B
6   0               6               B
  1. We first generate a split, this gives us A, B.
  2. First A is the training set, we keep the outlier? We remove outliers in B before it is passed to another pipeline?
  3. Then B is the training set, we keep the outlier? We remove outliers in A before it is passed to another pipeline?

If you want to throw out novely before passing it to another pipeline ... you're gonna need an outlier detector first no? When you say;

So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not.

If we have a label for being an outlier ... that's sometimes called classification. Do you have a usecase in mind here? There might certainly be something interesting here but this discussion feels just a tad bit theoretical. What problem will this solve in real life?

It deserves mentioning, our implementation of OutlierRemover seems relevant to mention.

MBrouns commented 4 years ago

I think it's the other way around @koaning. The train set should not contain the outliers, so if A is the training set in step 2, we remove observation with X=1.

There's not necessarily a link with other pipelines or models, the idea here I think is if your outlier removal is a gmm, you don't want the known outliers to be in there as the might skew your fitting on 'normal' data. In validation you do want them to evaluate your method.

MBrouns commented 4 years ago

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

y is definitely available in the cv's split method, StratifiedKFold relies on this for example

janvdvegt commented 4 years ago

Then it should not be too difficult. One issue with this approach however is that it seems like you would have to implement it for every different CV strategy. Is there a way around this if you take this approach? It might be possible to extend current CV strategies by inheritance and add an additional filter to the training fold but this would require an additional pass over the data.

FBruzzesi commented 10 months ago

If there is still interest in this feature, I would be happy to give it a try, this looks like a nice feature to have. However I have a couple of questions:

koaning commented 10 months ago

Pun intended ... maybe this classname: WithoutlierCV?

It sure sounds better than WithoutOutlierCV but then again the letter is more literally explaining what it does without trying to be clever so that's probably better.