vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
Other
312 stars 37 forks source link

Distance Matrix support #24

Closed TSFelg closed 5 years ago

TSFelg commented 5 years ago

I'm currently using LOF for a Distance Matrix. Is it possible to also use a Distance Matrix for LoOP? Or are the points needed for the computation of the probabilities?

vc1492a commented 5 years ago

@TFelgueira thanks for opening this issue! In the current PyNomaly implementation, it is not possible to use a Distance Matrix as opposed to the actual values used in computing that distance matrix. This however could be introduced as a new feature in the current Numpy implementation. I've also been planning to transition PyNomaly to a scikit-learn code base in the future, and I believe that implementation would more readily support the use of a distance matrix. Would you be able to share a little bit more information about your use case? This will help me determine when would be an appropriate time to introduce this capability into PyNomaly. Thanks!

TSFelg commented 5 years ago

Of course, thank you for you interest!

I have many histograms which I want to perform clustering/outlier detection on. But histograms require specific statistical distances like Chi2 and Earth Mover's Distance, which are not available in most clustering/outlier detection tools. Hence why I've been calculating the distance matrix myself and then using HDSBCAN and LOF, for example, to cluster the histograms and/or find outliers.

The fact that LoOP adds the probabilistic view to LOF is a big advantage for my use case, hence why it would be a great help to have it accept distance matrices :)

vc1492a commented 5 years ago

@TFelgueira thanks for the information. I've decided to include this feature in the next release, 0.2.6. Before that happens, you can checkout this commit on the dev branch. It includes an implementation that allows you to provide a distance matrix and neighbor index matrix (i.e. unique IDs of the closest neighbors) in calculating the local outlier probability. A few things:

Providing a distance matrix is not yet implemented for the stream functionality and I haven't written any unit tests yet for this new functionality - I hope to get to that soon and test this functionality more thoroughly before merging with master. In the meantime, checkout iris_dist_grid.py in the examples for using your own distance matrix with PyNomaly.

vc1492a commented 5 years ago

@TFelgueira I have merged dev with master and released this feature as part of the 0.2.6 release. With the most recent version, you can now specify a distance matrix and a neighbor matrix and use those matrices to calculate the LoOP.