Closed ivan-marroquin closed 2 years ago
Hi,
I tried the following custom metric - which I think is a good approximation to minkowski with fractional p
@numba.njit( [ "f4(f4[::1],f4[::1])", numba.types.float32( numba.types.Array(numba.types.float32, 1, "C", readonly=True), numba.types.Array(numba.types.float32, 1, "C", readonly=True), ), ], fastmath=True, locals={ "result": numba.types.float32, "dist": numba.types.float32, "fraction": numba.types.float32, "isamp": numba.types.uint16, }, ) def fractional_dist(p_vec, q_vec): """ this function implements the fractional distance metric, where fraction can be 0.3 or 0.4
based on research work by:
Aggarwal, C. C., A. Hinneburg, and D. A. Keim, 2001, On the surprising
behavior of distance metrics in high dimensional space: International
Conference on Database Theory, 420 - 434
"""
fraction= 0.3
result= 0.0
for isamp in range(0, p_vec.shape[0]):
if (p_vec[isamp] > q_vec[isamp]):
result += (p_vec[isamp] - q_vec[isamp]) ** fraction
else:
result += (q_vec[isamp] - p_vec[isamp]) ** fraction
dist= result ** (1 / fraction)
return dist
and then, I run these commands:
custom= pynndescent.NNDescent(data, metric= fractional_dist, n_neighbors= 15, random_state= 1969, n_jobs= -1) custom.prepare() cus_neig, cus_dist= custom.query(data, k= 15)
with the following results print(cus_neig[0,:]) [ 0 328 405 110 43 372 326 433 154 242 218 70 28 192 45]
print(cus_dist[0,:]) [ 0. 437.5364 537.1164 551.1845 608.3099 651.9673 684.3567 686.64856 686.86884 692.1493 695.0585 727.54346 748.20636 748.9065 762.17645]
The self-point has a distance of 0 which - I believe - is the desired result. However, the remaining distances do not match the ones obtained using minkowski p= 0.3
Any suggestions?
Kind regards, Ivan
I couldn't reproduce this locally but I have a later version installed. Is this fixed in 0.5.5
?
Hi @jamestwebber
I installed pynndescent 0.5.5 and used the Python code used to report this incident (on Nov 11, 2021). Unfortunately, I still have the same issue. Did you use my Python code? or something else?
Kind regards,
Ivan
I installed pynndescent 0.5.5 and used the Python code used to report this incident (on Nov 11, 2021). Unfortunately, I still have the same issue. Did you use my Python code? or something else?
I just ran your code again on my machine, and the self-distance for Minkowski is 0. That's weird! I'm using python 3.9.6
, numpy 1.20.3
and numba 0.54.1
. edit: same result for python 3.7.10
and numpy 1.19.5
.
Does this happen on any random data, or a specific array you're generating? You didn't set a seed there so I assume it's happening a lot?
Hi @jamestwebber
I have Python 3.6.5 with numpy 1.19.5 and pynndescent 0.5.5. The code posted in here is to demonstrate the issue: the self-distance is not 0. I also tried the code posted in here by setting a seed, and I still get the same issue.
Unfortunately, I have the same issue with my data sets: the self-distance is not 0.
Is it possible that there is something in Python 3.6.5 that causes this issue?
I will download Python 3.7.5, and give a try again. Let you know soon.
Many thanks for your help,
Ivan
Hm I just made a new environment with python 3.6.15, numpy 1.19.5, and pynndescent 0.5.5, and still didn't see this issue. Very strange.
Hi @jamestwebber
I did a fresh installation of Python 3.7.5 on my windows machine, then I made sure that I have pynndescent 0.5.5
Unfortunately, the issue still persists. May be something in my windows configuration or Python installation is causing this error.
Could you point me to the code that computes minkowski with p < 1? So, I can compare it to my approach to compute fractional minkowski (see my comments on Dec 9, 2021)
Ivan
As far as I know it's this: https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/distances.py#L113-L129
Unfortunately, the issue still persists. May be something in my windows configuration or Python installation is causing this error.
This does seem like a weird configuration issue. One thing to check is that you're using the right Python and importing everything from where you think, by checking sys.executable
and checking pynndescent.__file__
and so forth. I have no idea what could cause this to work different on your machine alone...
Hi @jamestwebber ,
Thanks for the link to how minkowski is computed in this package! My own code looks pretty close to the one used. So, I will use my code to bypass the issue that I reported.
I will close this issue as it seems there is something in my windows machine causing the reported issue.
Ivan
Hi,
Thanks for such great package!
I am using version 0.5.2 with Python 3.6.5. I compared the results of using euclidean distance vs. minkowski distance with p= 0.3 in the code below: import numpy as np import pynndescent
data= np.random.randn(500,9)
euclidean= pynndescent.NNDescent(data, metric= 'euclidean', n_neighbors= 15, random_state= 1969, n_jobs= -1) euclidean.prepare() euc_neig, euc_dist= euclidean.query(data, k= 15)
minkowski= pynndescent.NNDescent(data, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= 15, random_state= 1969, n_jobs= -1) minkowski.prepare() min_neig, min_dist= minkowski.query(data, k= 15)
Then, I looked at the outputs for the item in the query: euc_neig[0,:] array([ 0, 150, 298, 308, 48, 498, 93, 277, 188, 184, 219, 146, 420, 309, 160]) euc_dist[0,:] array([0. , 2.3493888, 2.3878694, 2.5395243, 2.5565991, 2.5695498, 2.5878463, 2.6194322, 2.624215 , 2.6499615, 2.662645 , 2.6754963, 2.7257159, 2.7690816, 2.808471 ], dtype=float32)
min_neig[0,:] array([ 0, 107, 14, 287, 260, 348, 57, 464, 22, 319, 298, 3, 186, 15, 424]) min_dist[0,:] array([ 10.079369, 219.20055 , 223.33961 , 227.77594 , 230.01102 , 248.14621 , 260.6848 , 288.2422 , 292.64026 , 304.21317 , 306.87302 , 311.07306 , 322.3568 , 326.94174 , 331.8149 ], dtype=float32)
As expected, the self-measured distance (first item in the query) is 0 for euclidean. However, for minkowski, the self-measured distance is not 0.
My question is, why the minkowski distance does not return 0 for the first item?
Thanks, Ivan