Inconsistency in case of dataframe and distance matrix input

jnpsk commented 1 year ago

This is not a project issue, but a suggestion to put some kind of warning in the distance matrix example in the README.

There is an example of using distance matrix as input for LoOP in the README. It shows how sklearn.neighbors.NearestNeighbors could be used to obtain distance matrix together with index matrix. It seems that this way the matricies also contain distance measures to a point itself, resulting to zero distance for the first nearest neighbor of every point. On the other hand internal method _compute_distance_and_neighbor_matrix, used when data argument is specified, excludes the distances to a point itself and so giving different scores on same data. I took a look into the test case, which allowes difference of 0.15 in scores vector, and thus the difference between 0.45 and 0.6 is considered negligible. I think the output metrices of sklearn.neighbors.NearestNeighbors should be transformed first to be consistent with the internal algorithm.

vc1492a commented 1 year ago

@jnpsk thank you for identifying and noting this behavior in this issue!

While it's not a project issue as you stated, I think it would be best to align the behavior of using your own distance matrix with that of the original Local Outlier Probabilities method, for consistency. Perhaps this can be resolved by adding one additional neighbor when using sklearn.neighbors.NearestNeighbors and truncating the distance vectors for each point to its closest neighbors (excluding itself).

This fix should be included in the next version pushed out.

vc1492a commented 1 year ago

@jnpsk can you please comment on the version of SciPy you have installed? After running the test case as is with 120 observations, I receive a difference of 0.0 between the approach using the internal method _compute_distance_and_neighbor_matrix and the method using sklearn.neighbors.NearestNeighbors. Could you please past your reproducible example below?

jnpsk commented 1 year ago

Having defined X_n120() function as in test cases, I can run the example as follows without raising the assert error running on SciPy 1.9.1.

data = X_n120()

from sklearn.neighbors import NearestNeighbors
from PyNomaly import loop
from sklearn.utils._testing import assert_almost_equal

# generate distance and neighbor indices
neigh = NearestNeighbors(metric='euclidean')
neigh.fit(data)
d, idx = neigh.kneighbors(data, n_neighbors=10, return_distance=True)

# fit loop using data and distance matrix
clf1 = loop.LocalOutlierProbability(data, n_neighbors=10)
clf2 = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=10)

scores1 = clf1.fit().local_outlier_probabilities
scores2 = clf2.fit().local_outlier_probabilities

# compare the agreement between the results
assert_almost_equal(scores1, scores2, decimal=1)

What I mean is that the d matrix has zeros in the first column, whereas the internal method does not include distances to point itself, resulting in slight inconsitence. It is not a big deal on that dataset combined with chosen number of n_neighbors = 10. But if the assert checks more decimal places, or lower number of neigbors is used (3), the two scores differs significantly.

vc1492a commented 3 weeks ago

Resolved this ticket by updating the documentation in the readme.md and including a test in tests/test_loop.py that not only illustrates the consistency but also how to achieve identical results using both the internal method and NearestNeighbors up to 10 decimal places. Covered in #72.

vc1492a / PyNomaly

Inconsistency in case of dataframe and distance matrix input #46