vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
Other
305 stars 36 forks source link

question again #25

Closed lisongs1995 closed 5 years ago

lisongs1995 commented 5 years ago

when use loop to detect outlier, it often goes wrong with the statement "return (probabilistic_distance / ev_prob_dist) - 1.", and I have to fix neightbors to a lower value. But sometimes when I gave 4 or 5, it also comes with "ZeroDIvisionError: float division by zero". Thx for answering !!!

vc1492a commented 5 years ago

@lisongs1995 thanks for opening this issue. This would be tough to diagnose without providing the dataset you are using - would you be able to share that dataset with me, and specifically which parameters you are using, that results in this issue?

lisongs1995 commented 5 years ago

Of course, sir, you are so nice. The dataset in use is kddcup smtp, you can download it in sklearn. from sklearn.datasets import fetch_kddcup99 data = fetch_kddcup99(subset='smtp', precent10=False) train_data, target = data.data, data.target model = loop.Loop(train_data, extent=3, n_neighbors=6) # class is aliased Loop

lisongs1995 commented 5 years ago

And I doubt that, this may be caused by deplicate samples ?

vc1492a commented 5 years ago

@lisongs1995 I've examined your code and have come to a few conclusions. First, I a not sure which version of PyNomaly you are using, but please go ahead and make sure you are using the latest version, 0.2.5 if you are not using it already.

I verified the issue and believe I have identified a root cause. I believe the issue may be due to having duplicate samples in the data. When attempting to calculate the distance between, we obtain a distance of zero throughout the neighborhood for some observations, and this results in a zero division error when calculating (probabilistic_distance / ev_prob_dist) - 1. (ev_prob_dist is equal to 0). The below code replicated your issue.

from PyNomaly import loop
from sklearn.datasets import fetch_kddcup99

data = fetch_kddcup99(subset='smtp', percent10=False)
train_data, target = data.data, data.target
m = loop.LocalOutlierProbability(train_data[0:1000].astype(float), extent=3, n_neighbors=6).fit() 

The following code shows that PyNomaly runs as intended without duplicate samples.

m = loop.LocalOutlierProbability(np.unique(train_data[0:1000].astype(float)), extent=3, n_neighbors=6).fit() 
print(m.local_outlier_probabilities)

I have labeled this as a bug to be addressed in the next version of PyNomaly. Hope to get to this in the next couple of weeks, thanks for bringing this to my attention!

vc1492a commented 5 years ago

@lisongs1995 I have updated the dev branch with a fix to this issue, specifically this commit. This fix will be pulled into master version 0.2.6 - in the meantime, feel free to pull the development branch and use it.

vc1492a commented 5 years ago

@lisongs1995 I have merged dev with master and released this bug fix as part of the 0.2.6 release. To fix the issue you're seeing, upgrade to the latest version of PyNomaly - things should work as intended.