storpipfugl / pykdtree

Fast kd-tree implementation in Python
GNU Lesser General Public License v3.0
215 stars 47 forks source link

How to limit number of dimensions to 512 . #47

Closed HangJie720 closed 4 years ago

HangJie720 commented 4 years ago

Due to our face feature embeddings' dimension is (1, 512), how can I revise source code to allow dimensions to 512?

Thank you very much.

djhoese commented 4 years ago

What error message are you getting?

I typically use pykdtree (from the pyresample package) and give it an array of shape (N, 3). Are you saying you have multiple points, each with 512 "coordinates"? So the actual data you give to the KDTree is (N, 512)? Or is it always (1, 512)?

HangJie720 commented 4 years ago

What error message are you getting?

I typically use pykdtree (from the pyresample package) and give it an array of shape (N, 3). Are you saying you have multiple points, each with 512 "coordinates"? So the actual data you give to the KDTree is (N, 512)? Or is it always (1, 512)?

query_pts shape is (1, 512), data_pts shape is (22000,512), my code is as follows:

tree = KDTree(data_pts) distance, idx = tree.query(query_pts, k=1, sqr_dists=True)

error is : 'Max 127 dimensions allowed'

djhoese commented 4 years ago

I can't say why it is limited to 127 (@storpipfugl would know), but the line that checks this is:

https://github.com/storpipfugl/pykdtree/blob/18d95a2e9aaad30b203dc8092ddd24c0bdd0d58a/pykdtree/kdtree.pyx#L118

Theoretically if you removed this line and the line below it (meaning no limit to the number) then pykdtree would let you do whatever you want. However, I have a strong feeling that there is a reason that number is set to 127. It is possible that the low-level code only allocates enough space for 127 dimensions and any more will cause a segmentation fault. Or the results will be inaccurate. I'm not sure.

HangJie720 commented 4 years ago

I can't say why it is limited to 127 (@storpipfugl would know), but the line that checks this is:

https://github.com/storpipfugl/pykdtree/blob/18d95a2e9aaad30b203dc8092ddd24c0bdd0d58a/pykdtree/kdtree.pyx#L118

Theoretically if you removed this line and the line below it (meaning no limit to the number) then pykdtree would let you do whatever you want. However, I have a strong feeling that there is a reason that number is set to 127. It is possible that the low-level code only allocates enough space for 127 dimensions and any more will cause a segmentation fault. Or the results will be inaccurate. I'm not sure.

although I removed this line, and reinstall, but the same error is appeared, so strange. File "pykdtree/kdtree.pyx", line 119, in pykdtree.kdtree.KDTree.__init__ ValueError: Max 127 dimensions allowed

HangJie720 commented 4 years ago

I can't say why it is limited to 127 (@storpipfugl would know), but the line that checks this is:

https://github.com/storpipfugl/pykdtree/blob/18d95a2e9aaad30b203dc8092ddd24c0bdd0d58a/pykdtree/kdtree.pyx#L118

Theoretically if you removed this line and the line below it (meaning no limit to the number) then pykdtree would let you do whatever you want. However, I have a strong feeling that there is a reason that number is set to 127. It is possible that the low-level code only allocates enough space for 127 dimensions and any more will cause a segmentation fault. Or the results will be inaccurate. I'm not sure.

I switch to v1.3.1 checkout, and then python3 setup.py install, new error is show as follows: raise ValueError('Data and query points must have same dimensions') the key question is my data_pts.shape is (20000,512), query_pts.shape is (1,512), This exception should not be thrown?

djhoese commented 4 years ago

You are likely not completely reinstalling the newly compiled version of the package.

If you check your KDTree objects .ndim after creating it it should be 512. If not, then you may have modified the code incorrectly.

Edit: The .pyx code is fairly close to python. You may be able to walk through it and see what is going on.

HangJie720 commented 4 years ago

you can run this program with random data.

import numpy as np from pykdtree.kdtree import KDTree

x = np.random.rand(512).reshape((1, 512)) y = np.random.rand(512*20000).reshape((20000, 512))

kdtree = KDTree(y) dist, idx = kdtree.query(x, sqr_dists=True)

error is show as follows: ValueError: Data and query points must have same dimensions

storpipfugl commented 4 years ago

The 127 dim limit is a hard limit as higher dimensional data will give integer overflows because int8_t is being used for dimension storage internally: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c.mako#L43 This is a deliberate choice (to save memory) as kd-trees are inefficient for dimensions above ~20 (brute force search will do the job just as well without the kd-tree memory overhead).

If you want to use kd-trees for NN search in a 512 dim space then dimensional reduction is needed first. Then a low dimensional kd-tree can be used to select candidate sets for neighbours and finally linear search can be performed on the (now much smaller) candidate sets using the corresponding 512 dimensional points.

mraspaud commented 4 years ago

I'm closing this as the original issue seems to have been answered.