storpipfugl / pykdtree

Fast kd-tree implementation in Python
GNU Lesser General Public License v3.0
218 stars 47 forks source link

pydktree breaks down for very large number of data points #38

Open alejandrobll opened 6 years ago

alejandrobll commented 6 years ago

I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:

from pykdtree.kdtree import KDTree
import numpy as np

pos = np.random.rand(int(5e8),3)
nb = 32
tree = KDTree(pos)
d, idx = tree.query(pos, k=nb)
h = d[:,nb-1]
print np.min(h)

The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing

import numpy as np

k, = np.where(h == 0)
print(len(k))

returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.

I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.

Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.

storpipfugl commented 6 years ago

It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410

I'll look into giving pykdtree an overhaul to support contemporary data set sizes