vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
Other
312 stars 37 forks source link

Passing cluster_labels broken #49

Open mdruiter opened 1 year ago

mdruiter commented 1 year ago

I think I have found a bug that occurs when passing some cluster_labels.

When I completely reverse the order of all input (data and cluster_labels), and I reverse the result (local_outlier_probabilities), I would expect the same numbers. This does happen as long as all cluster_labels values are equal. Once I have two (really separate) clusters, the results change when flipped! An extra indication that things go wrong (IMHO): the second cluster's neighbor numbers are in the first cluster!

A small reproduction example:

import matplotlib.pyplot as plt
from PyNomaly import loop

np.random.seed(1)
n = 9
data = np.append(np.random.normal(2, 1, [n, 2]), np.random.normal(8, 1, [n, 2]), axis=0)
clus = np.append(np.ones(n),                     2 * np.ones(n)).tolist()  # 2 cluster numbers!
model = loop.LocalOutlierProbability(data, n_neighbors=5, cluster_labels=clus)
fit = model.fit()
res = fit.local_outlier_probabilities
print(res)
print(fit.neighbor_matrix)

data_flipped = np.flipud(data)
clus_flipped = np.flipud(clus).tolist()
model2 = loop.LocalOutlierProbability(data_flipped, n_neighbors=5, cluster_labels=clus_flipped)
fit2 = model2.fit()
res2 = np.flipud(fit2.local_outlier_probabilities)
print(res2)
print(np.flipud(fit2.neighbor_matrix))

s  = 1 + 100 * res.astype(float)
s2 = 1 + 100 * res2.astype(float)
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s,  marker='+')
plt.scatter(data[:, 0], data[:, 1], c=clus, s=s2, marker='x')
plt.show()
mdruiter commented 1 year ago

The problem is in the 'definition' of neighbor_matrix: _compute_distance_and_neighbor_matrix returns indexes within the cluster, but _prob_distances_ev treats the numbers as being global.

vc1492a commented 1 year ago

Hey @mdruiter - thanks for noting the issue and where it is occurring.

Are you able to submit a fix in a pull request?