Closed lisongs1995 closed 5 years ago
@lisongs1995 thanks for opening this issue. This would be tough to diagnose without providing the dataset you are using - would you be able to share that dataset with me, and specifically which parameters you are using, that results in this issue?
Of course, sir, you are so nice. The dataset in use is kddcup smtp, you can download it in sklearn. from sklearn.datasets import fetch_kddcup99 data = fetch_kddcup99(subset='smtp', precent10=False) train_data, target = data.data, data.target model = loop.Loop(train_data, extent=3, n_neighbors=6) # class is aliased Loop
And I doubt that, this may be caused by deplicate samples ?
@lisongs1995 I've examined your code and have come to a few conclusions. First, I a not sure which version of PyNomaly you are using, but please go ahead and make sure you are using the latest version, 0.2.5
if you are not using it already.
I verified the issue and believe I have identified a root cause. I believe the issue may be due to having duplicate samples in the data. When attempting to calculate the distance between, we obtain a distance of zero throughout the neighborhood for some observations, and this results in a zero division error when calculating (probabilistic_distance / ev_prob_dist) - 1.
(ev_prob_dist is equal to 0). The below code replicated your issue.
from PyNomaly import loop
from sklearn.datasets import fetch_kddcup99
data = fetch_kddcup99(subset='smtp', percent10=False)
train_data, target = data.data, data.target
m = loop.LocalOutlierProbability(train_data[0:1000].astype(float), extent=3, n_neighbors=6).fit()
The following code shows that PyNomaly runs as intended without duplicate samples.
m = loop.LocalOutlierProbability(np.unique(train_data[0:1000].astype(float)), extent=3, n_neighbors=6).fit()
print(m.local_outlier_probabilities)
I have labeled this as a bug to be addressed in the next version of PyNomaly. Hope to get to this in the next couple of weeks, thanks for bringing this to my attention!
@lisongs1995 I have updated the dev
branch with a fix to this issue, specifically this commit. This fix will be pulled into master
version 0.2.6 - in the meantime, feel free to pull the development branch and use it.
@lisongs1995 I have merged dev
with master
and released this bug fix as part of the 0.2.6
release. To fix the issue you're seeing, upgrade to the latest version of PyNomaly - things should work as intended.
when use loop to detect outlier, it often goes wrong with the statement "return (probabilistic_distance / ev_prob_dist) - 1.", and I have to fix neightbors to a lower value. But sometimes when I gave 4 or 5, it also comes with "ZeroDIvisionError: float division by zero". Thx for answering !!!