[FEATURE] Outlier or Other Filters for usage in Pipelines

x-tabdeveloping commented 1 year ago

Motivating example

I keep finding myself in situations where I need to somehow filter the inputs to an estimator at the end of a pipeline, and this is especially true for unsupervised learning. Let's say I have a large dataset, and I would like to get a high-quality NMF document embedding model.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import MiniBatchNMF
from skpartial import make_partial_pipeline

# This is my pipeline
embedding_pipeline = make_partial_pipeline(
    CountVectorizer(),
    MiniBatchNMF()
)

# And I have a chunked corpus
corpus: Iterable[list[str]] = ...

# I have a training loop like this
for chunk in corpus:
    embedding_pipeline.partial_fit(chunk)

But the problem is is that my dataset contains a lot of pornographic content, which might skew my results in unexpected ways (true story btw). Let's say that I have some way of reliably identifying documents that I do not want. For example a classifier that returns -1 if we consider something garbage, and 1 if we consider it good. (Just like novelty detection in sklearn). I would like to make use of this model of mine that can classify things, but as of yet there is no pipeline component that can do this.

Proposed Solution

An OutlierFilter or ModelBasedFilter or whatever metaestimator that can filter out outliers or unwanted content in a training pipeline.

Here's a short semantic example of how I imagine this would work

import numpy as np
import scipy.sparse as spr
from sklearn.base import BaseEstimator, MetaEstimatorMixin, TransformerMixin

class OutlierFilter(TransformerMixin, BaseEstimator, MetaEstimatorMixin):
    def __init__(self, estimator: BaseEstimator, prefit: bool = False):
        self.estimator = estimator
        self.prefit = prefit

    def fit(self, X, y=None):
        if not self.prefit:
            self.estimator.fit(X, y)

    def transform(self, X):
        labels = self.estimator.predict(X)
        passes = labels == 1
        if isinstance(X, np.ndarray, spr.sparray):
            return X[passes]
        else:
            return [elem for p, elem in zip(passes, X) if p]

Then one could simply include this in the pipeline outline above.

# This is my pipeline
embedding_pipeline = make_partial_pipeline(
    OutlierFilter(PornClassifier),
    CountVectorizer(),
    MiniBatchNMF()
)

If you deem this inappropriate for scikit-lego I will still implement it in a library of my own, I just thought it might make sense to contribute to something already dedicated to extending sklearn rather than create my own stuff. :))

FBruzzesi commented 1 year ago

Hi @x-tabdeveloping, I think it's a little bit hidden in the library but preprocessing.OutlierRemover may do the job 😊

x-tabdeveloping commented 1 year ago

Thank you for pointing this out, I don't know why I wasn't able to find this. One thing that bugs me though is that this will still not be enough for my use case as it can only deal with array input as per:

return X[predictions != -1]

Do you think it would be okay if I change the implementation so that it can also deal with lists and other iterables?

x-tabdeveloping commented 1 year ago

Another potential issue is that TrainOnlyTransformerMixin only accepts ndarrays and pandas data structures for hashing. What if I want to use an arbitrary iterable structure, AwkwardArray, etc.?

FBruzzesi commented 1 year ago

I don't know why I wasn't able to find this.

I don't find it in the docs aswell, yet I remember having used that in the past.

it can only deal with array input

Curious to know in which case converting the list/iterable into an array before handling that to the model is not an option.

Do you think it would be okay if I change the implementation so that it can also deal with lists and other iterables?

I am not a maintainer, was just trying to be helpful 😂

koaning commented 1 year ago

I am not a maintainer, was just trying to be helpful 😂

@FBruzzesi and the help is welcome! Even better: if you're interested in helping out with the maintenance, do tap my shoulder about that at PyData Amsterdam.

Do you think it would be okay if I change the implementation so that it can also deal with lists and other iterables?

@x-tabdeveloping To my knowledge, scikit-learn does assume something "array-like" most of the time, but there are exceptions. The TFIDFVectorizer is one of them. You can pass generators of text, which can be super useful for larger datasets. I think the count vectorizer might also allow for that. The downside of going "outside" of the array-like assumption is that speed tends to get a bunch slower.

That said, it seems you're referring to a classifier as opposed to a outlierdetector. And it could be that this is something that our library does not do. However, the OutlierRemover is also a bit of an anti-pattern. Most scikit-learn pipelines don't remove data, and you could wonder if it's an anti-pattern to remove data from within a pipeline. It could also be seen as a filtering step that isn't part of the "ML-learning step".

x-tabdeveloping commented 1 year ago

Yeah I was actually wondering whether we would consider this an antipattern or not, and I felt a bit ideologically unsure, this is why I opened an issue instead of just jumping on it right away. The only reason I was thinking a lot about something like this, is that I don't feel this process of filtering (and/or streaming) certain (potentially textual) data is not quite streamlined yet, whereas the concept of this machine learning Pipeline of different trainable (or non-trainable) components is emerging to be something quite powerful, and I just couldn't help thinking whether a Pipeline could accommodate some of these steps as well.

Me an a colleague are currently trying to streamline and sklearn-ify the way we train word and document embeddings at the Center for Humanities Computing, as I found myself reimplementing virtually the same stuff (iterable wranglers, cleaners, trash detectors, iterative training etc.) whenever I was working on something similar. I guess some engineering and ideological work has to go into somehow fitting all of this into one bigger something :D

As for the iterable thingy, I think there is certainly a way in which one could implement behaviour that is compatible with arbitrary iterables and does not compromise performance.

passes = predictions != -1
try:
    res = X[passes]
except SomeError:
    # Or you could turn this into an array or a list whatevs.
    res = (elem for is_good, elem in zip(passes, X))
return res

Of course then the hashing would still introduce horrible complications. Like one could turn it into a tuple and then hash that but then are the elements hashable? And that also introduces interesting dilemmas about forcing evaluation at that point of the pipeline, like is the iterable repeatable, and also how can one in good conscience evaluate something and then not let the rest of the pipeline use the evaluated version. Iterables are one huge rabbit hole-abyss that's for sure.

koaning commented 1 year ago

Me an a colleague are currently trying to streamline and sklearn-ify the way we train word and document embeddings at the Center for Humanities Computing, as I found myself reimplementing virtually the same stuff (iterable wranglers, cleaners, trash detectors, iterative training etc.) whenever I was working on something similar. I guess some engineering and ideological work has to go into somehow fitting all of this into one bigger something :D

I mean ... "we" over at calmcode labs have a tool for that called "embetter". Check it out, you might like it. It even explores the space of "finetuners".

koaning commented 1 year ago

For now though, I think I'll close this issue. It doesn't seem like there's an issue with this library and new features should only get added once we've figured out the maintainer situation.

This package has gotten popular, but both @MBrouns and myself aren't really users of the package anymore and have moved on to work on other problems. So we should figure out a transition to new maintainers first before new features get added.

x-tabdeveloping commented 1 year ago

I know about embetter and use it extensively in my work, I even recommend it in some of my packages' docs. We need Word2Vec and Doc2Vec though, so that's what we're working on rn.

koaning commented 1 year ago

Feel free to open up an issue for that on the embetter side. Adding gensim support should be relatively easy.

On Sun 6. Aug 2023 at 13:20, Márton Kardos @.***> wrote:

I know about embetter and use it extensively in my work, I even recommend it in some of my packages' docs. We need Word2Vec and Doc2Vec though, so that's what we're working on rn.

— Reply to this email directly, view it on GitHub https://github.com/koaning/scikit-lego/issues/575#issuecomment-1666821410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHY7DYH7DFJKENRBN5EQCTXT54XLANCNFSM6AAAAAA3CLSPAE . You are receiving this because you modified the open/close state.Message ID: @.***>

--

- Vincent

koaning / scikit-lego

[FEATURE] Outlier or Other Filters for usage in Pipelines #575

Motivating example

Proposed Solution