Closed jnpsk closed 3 weeks ago
@jnpsk thank you for identifying and noting this behavior in this issue!
While it's not a project issue as you stated, I think it would be best to align the behavior of using your own distance matrix with that of the original Local Outlier Probabilities method, for consistency. Perhaps this can be resolved by adding one additional neighbor when using sklearn.neighbors.NearestNeighbors
and truncating the distance vectors for each point to its closest neighbors (excluding itself).
This fix should be included in the next version pushed out.
@jnpsk can you please comment on the version of SciPy you have installed? After running the test case as is with 120 observations, I receive a difference of 0.0
between the approach using the internal method _compute_distance_and_neighbor_matrix
and the method using sklearn.neighbors.NearestNeighbors
. Could you please past your reproducible example below?
Having defined X_n120()
function as in test cases, I can run the example as follows without raising the assert error running on SciPy 1.9.1.
data = X_n120()
from sklearn.neighbors import NearestNeighbors
from PyNomaly import loop
from sklearn.utils._testing import assert_almost_equal
# generate distance and neighbor indices
neigh = NearestNeighbors(metric='euclidean')
neigh.fit(data)
d, idx = neigh.kneighbors(data, n_neighbors=10, return_distance=True)
# fit loop using data and distance matrix
clf1 = loop.LocalOutlierProbability(data, n_neighbors=10)
clf2 = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=10)
scores1 = clf1.fit().local_outlier_probabilities
scores2 = clf2.fit().local_outlier_probabilities
# compare the agreement between the results
assert_almost_equal(scores1, scores2, decimal=1)
What I mean is that the d
matrix has zeros in the first column, whereas the internal method does not include distances to point itself, resulting in slight inconsitence. It is not a big deal on that dataset combined with chosen number of n_neighbors = 10
. But if the assert checks more decimal places, or lower number of neigbors is used (3), the two scores differs significantly.
Resolved this ticket by updating the documentation in the readme.md
and including a test in tests/test_loop.py
that not only illustrates the consistency but also how to achieve identical results using both the internal method and NearestNeighbors
up to 10 decimal places. Covered in #72.
This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.
There is an example of using distance matrix as input for LoOP in the README. It shows how
sklearn.neighbors.NearestNeighbors
could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point. On the other hand internal method_compute_distance_and_neighbor_matrix
, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data. I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible. I think the output metrices ofsklearn.neighbors.NearestNeighbors
should be transformed first to be consistent with the internal algorithm.