lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
897 stars 105 forks source link

Measuring minkowski with p= 0.3 #151

Closed ivan-marroquin closed 2 years ago

ivan-marroquin commented 3 years ago

Hi,

Thanks for such great package!

I am using version 0.5.2 with Python 3.6.5. I compared the results of using euclidean distance vs. minkowski distance with p= 0.3 in the code below: import numpy as np import pynndescent

data= np.random.randn(500,9)

euclidean= pynndescent.NNDescent(data, metric= 'euclidean', n_neighbors= 15, random_state= 1969, n_jobs= -1) euclidean.prepare() euc_neig, euc_dist= euclidean.query(data, k= 15)

minkowski= pynndescent.NNDescent(data, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= 15, random_state= 1969, n_jobs= -1) minkowski.prepare() min_neig, min_dist= minkowski.query(data, k= 15)

Then, I looked at the outputs for the item in the query: euc_neig[0,:] array([ 0, 150, 298, 308, 48, 498, 93, 277, 188, 184, 219, 146, 420, 309, 160]) euc_dist[0,:] array([0. , 2.3493888, 2.3878694, 2.5395243, 2.5565991, 2.5695498, 2.5878463, 2.6194322, 2.624215 , 2.6499615, 2.662645 , 2.6754963, 2.7257159, 2.7690816, 2.808471 ], dtype=float32)

min_neig[0,:] array([ 0, 107, 14, 287, 260, 348, 57, 464, 22, 319, 298, 3, 186, 15, 424]) min_dist[0,:] array([ 10.079369, 219.20055 , 223.33961 , 227.77594 , 230.01102 , 248.14621 , 260.6848 , 288.2422 , 292.64026 , 304.21317 , 306.87302 , 311.07306 , 322.3568 , 326.94174 , 331.8149 ], dtype=float32)

As expected, the self-measured distance (first item in the query) is 0 for euclidean. However, for minkowski, the self-measured distance is not 0.

My question is, why the minkowski distance does not return 0 for the first item?

Thanks, Ivan

ivan-marroquin commented 2 years ago

Hi,

I tried the following custom metric - which I think is a good approximation to minkowski with fractional p

@numba.njit( [ "f4(f4[::1],f4[::1])", numba.types.float32( numba.types.Array(numba.types.float32, 1, "C", readonly=True), numba.types.Array(numba.types.float32, 1, "C", readonly=True), ), ], fastmath=True, locals={ "result": numba.types.float32, "dist": numba.types.float32, "fraction": numba.types.float32, "isamp": numba.types.uint16, }, ) def fractional_dist(p_vec, q_vec): """ this function implements the fractional distance metric, where fraction can be 0.3 or 0.4

    based on research work by:
    Aggarwal, C. C., A. Hinneburg, and D. A. Keim, 2001, On the surprising
    behavior of distance metrics in high dimensional space: International
    Conference on Database Theory, 420 - 434
    """

    fraction= 0.3

    result= 0.0

    for isamp in range(0, p_vec.shape[0]):
        if (p_vec[isamp] > q_vec[isamp]):
            result += (p_vec[isamp] - q_vec[isamp]) ** fraction

        else:
            result += (q_vec[isamp] - p_vec[isamp]) ** fraction

    dist= result ** (1 / fraction)

    return dist

and then, I run these commands:

custom= pynndescent.NNDescent(data, metric= fractional_dist, n_neighbors= 15, random_state= 1969, n_jobs= -1) custom.prepare() cus_neig, cus_dist= custom.query(data, k= 15)

with the following results print(cus_neig[0,:]) [ 0 328 405 110 43 372 326 433 154 242 218 70 28 192 45]

print(cus_dist[0,:]) [ 0. 437.5364 537.1164 551.1845 608.3099 651.9673 684.3567 686.64856 686.86884 692.1493 695.0585 727.54346 748.20636 748.9065 762.17645]

The self-point has a distance of 0 which - I believe - is the desired result. However, the remaining distances do not match the ones obtained using minkowski p= 0.3

Any suggestions?

Kind regards, Ivan

jamestwebber commented 2 years ago

I couldn't reproduce this locally but I have a later version installed. Is this fixed in 0.5.5?

ivan-marroquin commented 2 years ago

Hi @jamestwebber

I installed pynndescent 0.5.5 and used the Python code used to report this incident (on Nov 11, 2021). Unfortunately, I still have the same issue. Did you use my Python code? or something else?

Kind regards,

Ivan

jamestwebber commented 2 years ago

I installed pynndescent 0.5.5 and used the Python code used to report this incident (on Nov 11, 2021). Unfortunately, I still have the same issue. Did you use my Python code? or something else?

I just ran your code again on my machine, and the self-distance for Minkowski is 0. That's weird! I'm using python 3.9.6, numpy 1.20.3 and numba 0.54.1. edit: same result for python 3.7.10 and numpy 1.19.5.

Does this happen on any random data, or a specific array you're generating? You didn't set a seed there so I assume it's happening a lot?

ivan-marroquin commented 2 years ago

Hi @jamestwebber

I have Python 3.6.5 with numpy 1.19.5 and pynndescent 0.5.5. The code posted in here is to demonstrate the issue: the self-distance is not 0. I also tried the code posted in here by setting a seed, and I still get the same issue.

Unfortunately, I have the same issue with my data sets: the self-distance is not 0.

Is it possible that there is something in Python 3.6.5 that causes this issue?

I will download Python 3.7.5, and give a try again. Let you know soon.

Many thanks for your help,

Ivan

jamestwebber commented 2 years ago

Hm I just made a new environment with python 3.6.15, numpy 1.19.5, and pynndescent 0.5.5, and still didn't see this issue. Very strange.

ivan-marroquin commented 2 years ago

Hi @jamestwebber

I did a fresh installation of Python 3.7.5 on my windows machine, then I made sure that I have pynndescent 0.5.5

Unfortunately, the issue still persists. May be something in my windows configuration or Python installation is causing this error.

Could you point me to the code that computes minkowski with p < 1? So, I can compare it to my approach to compute fractional minkowski (see my comments on Dec 9, 2021)

Ivan

jamestwebber commented 2 years ago

As far as I know it's this: https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/distances.py#L113-L129

jamestwebber commented 2 years ago

Unfortunately, the issue still persists. May be something in my windows configuration or Python installation is causing this error.

This does seem like a weird configuration issue. One thing to check is that you're using the right Python and importing everything from where you think, by checking sys.executable and checking pynndescent.__file__ and so forth. I have no idea what could cause this to work different on your machine alone...

ivan-marroquin commented 2 years ago

Hi @jamestwebber ,

Thanks for the link to how minkowski is computed in this package! My own code looks pretty close to the one used. So, I will use my code to bypass the issue that I reported.

I will close this issue as it seems there is something in my windows machine causing the reported issue.

Ivan