Closed MaxHalford closed 5 years ago
It's worth mentioning that sklearn
actually has a mixin for outliers. A main thing, IIRC, is that the output of .predict()
is always 0 or -1.
I wonder if perhaps it might be normal or required to set a threshold
value. Outlier detection is usually done by 1. calculating a likelihood metric and 2. checking if this exceeds a certain threshold. A dynamic threshold should not be needed when the likelihood is able to update.
Yeah indeed in sklearn
it's either -1 or 1, which I find a bit restrictive. Using a threshold (manual or dynamic) is an idea but in the end it doesn't really matter if what we're doing is ranking observations. This really depends on the use case and what the users expect. I could easily create a Leaderboard
class which could be updated online and store the n
observations that are most likely to be outliers, as determined by the score_one
function.
By the way here is a paper which explains a bit what Flink does for outlier detection.
This paper seems like a fun one to implement. It also produces an outlier arbitrary which fits well with this idea of ranking outliers in a real-time leaderboard. It seems to takes ideas from isolation forests, but is made for streaming data. Maybe @raphaelsty wants to take a look at this when he's done with the LDA?
@MaxHalford I will be very happy to work on this algorithm after completing the LDA.
The anomaly
module has just been added.
Now that @koaning has implemented
EWVar
for computing a variance that adapts to the data, we should be able to add a simple outlier detector. But first we have to decide on a contract interface. Indeed I believe that we want to handle outlier detection differently from other kinds of models.I was thinking like this:
We're going to assume that we're not going to be given labels. If observations are labeled as outliers or not then we can simply do binary classification. We're going to frame outlier detection as an unsupervised problem. I don't want to put this in the
Transformer
box because the semantics are different.score_one
should return a "score" whose magnitude depends on the model. The score doesn't really matter, what matters is the ordering of the observations. Indeed we don't want to say if an observation is an outlier or not, but rather maintain a leaderboard of outliers. In my experience this is a common practice in industry. When I worked at HelloFresh we would have a leaderboard of potential frauds and go through in descending order, based on the belief we had that the observations were outliers.Feedback is more than welcome. I'll give you a cookie if you have a good idea.