manzilzaheer / CoverTree

Cover Tree implementation in C++ for k-Nearest Neighbours and range search
Apache License 2.0
90 stars 19 forks source link

Python API segfaults on small datasets #4

Open TapdancingRodent opened 7 years ago

TapdancingRodent commented 7 years ago

I'm getting a segfault when using the Python API (very grateful for that addition by the way =D), am I using it correctly? The API reference PDF doesn't list the constructor. I'm using gcc/g++7.1.0 and Python 2.7, see the gdb trace below.

>>> ct = CoverTree.from_matrix(document_embeddings) Faster Cover Tree with base 1.3 Max distance: 65.5565 Scale chosen: 18 100% [==================================================] [New Thread 0x7ffff13fe700 (LWP 11853)] [New Thread 0x7fffe8bfd700 (LWP 11854)] Duplicate entry!!!

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe8bfd700 (LWP 11854)] run (src=..., dst=...) at lib/Eigen/src/Core/Assign.h:410 410 dst.template copyPacket<Derived2, dstAlignment, srcAlignment>(index, src);

TapdancingRodent commented 7 years ago

OK, I found test.py and noting that it runs fine I've managed to narrow down that the segfault occurs only when the dataset is very small (<50,000 entries for me)

hoijui commented 7 years ago

hello! :-) can you maybe give the dataset, and complete reproducing instructions please? (i am just a by-passer, maybe the original dev(s) don't need that, but i do) .. alternatively, how to modify the test datasets to reproduce this would also work.

TapdancingRodent commented 7 years ago

Hi, I modified the test script to run on less data and managed to get a segfault in the training and a floating point error in testing, see the code dumps below.

test_populate.py

import numpy as np
import scipy as sc
from covertree import CoverTree
from sklearn.neighbors import NearestNeighbors

gt = time.time
np.random.seed(seed=3)

print 'Building cover tree'
x = np.random.rand(10000,128)
with open('train_data.bin', 'wb') as f:
    np.array(x.shape, dtype='int32').tofile(f)
    x.tofile(f)
print x[0,0], x[0,1], x[1,0]

mx = np.mean(x,0)
dists = np.array([np.sqrt(np.dot(xv-mx,xv-mx)) for xv in x])
idx = np.argsort(-dists)
xs = x[idx]
#print sc.spatial.distance.squareform(sc.spatial.distance.pdist(x, 'euclidean'))
t = gt()
ct = CoverTree.from_matrix(x)
b_t = gt() - t
#ct.display()
print "Building time:", b_t, "seconds"

print "Test covering: ", ct.test_covering()

print 'Generate random points'
y = np.random.rand(5000,128)
with open('test_data.bin', 'wb') as f:
    np.array(y.shape, dtype='int32').tofile(f)
    y.tofile(f)

print 'Test Nearest Neighbour: '
t = gt()
a = ct.NearestNeighbour(y)
b_t = gt() - t
print "Query time:", b_t, "seconds"
nbrs = NearestNeighbors(n_neighbors=1, algorithm='brute').fit(xs)
distances, indices = nbrs.kneighbors(y)
b = np.squeeze(xs[indices])
if np.all(a==b):
    print "Test for Nearest Neighbour passed"
else:
    print "Test for Nearest Neighbour failed"

print 'Test k-Nearest Neighbours (k=2): '
t = gt()
a = ct.kNearestNeighbours(y,2)
b_t = gt() - t
print "Query time:", b_t, "seconds"
nbrs = NearestNeighbors(n_neighbors=2, algorithm='brute').fit(xs)
distances, indices = nbrs.kneighbors(y)
if np.all(a==xs[indices]):
    print "Test for k-Nearest Neighbours passed"
else:
    print "Test for k-Nearest Neighbours failed"

print 'Test delete: '
x2 = np.vstack((xs[:indices[0,0]], xs[indices[0,0]+1:]))
dels = ct.remove(xs[indices[0,0]])
print 'Point deleted: ', dels
a = ct.NearestNeighbour(y)
nbrs = NearestNeighbors(n_neighbors=1, algorithm='brute').fit(x2)
distances, indices = nbrs.kneighbors(y)
b = np.squeeze(x2[indices])
if np.all(a==b):
    print "Test for delete passed"
else:
    print "Test for delete failed"

test_search.py

import numpy as np
import scipy as sc
from covertree import CoverTree
from sklearn.neighbors import NearestNeighbors

gt = time.time
np.random.seed(seed=3)

print 'Building cover tree'
x = np.random.rand(500000,128)
with open('train_data.bin', 'wb') as f:
    np.array(x.shape, dtype='int32').tofile(f)
    x.tofile(f)
print x[0,0], x[0,1], x[1,0]

mx = np.mean(x,0)
dists = np.array([np.sqrt(np.dot(xv-mx,xv-mx)) for xv in x])
idx = np.argsort(-dists)
xs = x[idx]
#print sc.spatial.distance.squareform(sc.spatial.distance.pdist(x, 'euclidean'))
t = gt()
ct = CoverTree.from_matrix(x)
b_t = gt() - t
#ct.display()
print "Building time:", b_t, "seconds"

print "Test covering: ", ct.test_covering()

print 'Generate random points'
y = np.random.rand(2,128)
with open('test_data.bin', 'wb') as f:
    np.array(y.shape, dtype='int32').tofile(f)
    y.tofile(f)

print 'Test Nearest Neighbour: '
t = gt()
a = ct.NearestNeighbour(y)
b_t = gt() - t
print "Query time:", b_t, "seconds"
nbrs = NearestNeighbors(n_neighbors=1, algorithm='brute').fit(xs)
distances, indices = nbrs.kneighbors(y)
b = np.squeeze(xs[indices])
if np.all(a==b):
    print "Test for Nearest Neighbour passed"
else:
    print "Test for Nearest Neighbour failed"

print 'Test k-Nearest Neighbours (k=2): '
t = gt()
a = ct.kNearestNeighbours(y,2)
b_t = gt() - t
print "Query time:", b_t, "seconds"
nbrs = NearestNeighbors(n_neighbors=2, algorithm='brute').fit(xs)
distances, indices = nbrs.kneighbors(y)
if np.all(a==xs[indices]):
    print "Test for k-Nearest Neighbours passed"
else:
    print "Test for k-Nearest Neighbours failed"

print 'Test delete: '
x2 = np.vstack((xs[:indices[0,0]], xs[indices[0,0]+1:]))
dels = ct.remove(xs[indices[0,0]])
print 'Point deleted: ', dels
a = ct.NearestNeighbour(y)
nbrs = NearestNeighbors(n_neighbors=1, algorithm='brute').fit(x2)
distances, indices = nbrs.kneighbors(y)
b = np.squeeze(x2[indices])
if np.all(a==b):
    print "Test for delete passed"
else:
    print "Test for delete failed"
hoijui commented 7 years ago

aehm... i am really no python dev, so.. could you please give me instruction how to run these? ;-) i put these files in my CoverTree/ folder, next to test.py, and run like this:

python test_populate.py

and then i get:

Traceback (most recent call last):
  File "test_populate.py", line 3, in <module>
    from covertree import CoverTree
  File "/home/hoijui/src/CoverTree/covertree/__init__.py", line 1, in <module>
    from covertree import CoverTree
  File "/home/hoijui/src/CoverTree/covertree/covertree.py", line 1, in <module>
    import covertreec
ImportError: No module named covertreec

i guess it is somehow related to setup.py, but i don't know how to use that either.

TapdancingRodent commented 7 years ago

Hey, you're right - you don't have the python bindings set up for your system, you just need to run the below:

python setup.py install

p.s. if you've got a default Python install, you'll also need to install the numpy, scipy and sklearn modules - I'd recommend using the pip Python package manager which is under python-pip if you're using aptitutde.

hoijui commented 7 years ago

thank you! :-) i was able to fix the SIGSEGV of test_populate.py, and the SIGFPE of test_search.py, but am stuck in a SIGSEGV caused later by test_search.py the fixes are here (together with some other, minor changes of mine): https://github.com/hoijui/CoverTree

how to debug this stuff: hints from here: http://www.boost.org/doc/libs/1_63_0/libs/python/doc/html/faq/how_do_i_debug_my_python_extensi.html

concretely i did:

ddd # this is a GUI frontend for gdb
# inside ddd ...
target exec python
run test_search.py

the problem i have with the remaining SIGSEGV, is that the backtrace shows nothing in this libraries code, which means we have to use valgrind to debug that one, which is kind of a pain .. not today anymore! ;-)

hoijui commented 7 years ago

.. i did so, and found the remaining bug (in many places in the code). so both your examples should now work; fixes are here: https://github.com/hoijui/CoverTree

TapdancingRodent commented 7 years ago

Hey, Great work and thanks for digging out the bug. I should be getting back into this code in a month or two when I'll switch over to your code base.

In the mean time I'll leave the issue open here in case Manzil continues developing this repo in the future.