online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5.09k stars 552 forks source link

Outlier detection API #99

Closed MaxHalford closed 5 years ago

MaxHalford commented 5 years ago

Now that @koaning has implemented EWVar for computing a variance that adapts to the data, we should be able to add a simple outlier detector. But first we have to decide on a contract interface. Indeed I believe that we want to handle outlier detection differently from other kinds of models.

I was thinking like this:

import abc

class BaseOutlierDetector(abc.ABC):

    @abc.abstractmethod
    def fit_one(self, x):
        """Updates the model"""

    @abc.abstractmethod
    def score_one(self, x):
        """Returns an outlier score."""

We're going to assume that we're not going to be given labels. If observations are labeled as outliers or not then we can simply do binary classification. We're going to frame outlier detection as an unsupervised problem. I don't want to put this in the Transformer box because the semantics are different. score_one should return a "score" whose magnitude depends on the model. The score doesn't really matter, what matters is the ordering of the observations. Indeed we don't want to say if an observation is an outlier or not, but rather maintain a leaderboard of outliers. In my experience this is a common practice in industry. When I worked at HelloFresh we would have a leaderboard of potential frauds and go through in descending order, based on the belief we had that the observations were outliers.

Feedback is more than welcome. I'll give you a cookie if you have a good idea.

koaning commented 5 years ago

It's worth mentioning that sklearn actually has a mixin for outliers. A main thing, IIRC, is that the output of .predict() is always 0 or -1.

I wonder if perhaps it might be normal or required to set a threshold value. Outlier detection is usually done by 1. calculating a likelihood metric and 2. checking if this exceeds a certain threshold. A dynamic threshold should not be needed when the likelihood is able to update.

MaxHalford commented 5 years ago

Yeah indeed in sklearn it's either -1 or 1, which I find a bit restrictive. Using a threshold (manual or dynamic) is an idea but in the end it doesn't really matter if what we're doing is ranking observations. This really depends on the use case and what the users expect. I could easily create a Leaderboard class which could be updated online and store the n observations that are most likely to be outliers, as determined by the score_one function.

By the way here is a paper which explains a bit what Flink does for outlier detection.

MaxHalford commented 5 years ago

This paper seems like a fun one to implement. It also produces an outlier arbitrary which fits well with this idea of ranking outliers in a real-time leaderboard. It seems to takes ideas from isolation forests, but is made for streaming data. Maybe @raphaelsty wants to take a look at this when he's done with the LDA?

raphaelsty commented 5 years ago

@MaxHalford I will be very happy to work on this algorithm after completing the LDA.

MaxHalford commented 5 years ago

The anomaly module has just been added.