Closed Petkomat closed 2 years ago
It is quite possible there is a bug here -- it is likely we need to rebuild the supporting data (including recompiling the query function) for querying. Thanks for spotting this. I'll see if I can figure something out soon.
The primary use case for the update so far had been to update the neighbor graph (which it seems to do), but querying that hasn't had as much (any?) work.
Ok, then I will just try to find a workaround. It seems that simply constructing a new NNDescent object with init_graph=...
should do: init graph would contain the "correct" neighbours for already known data and random ones for the fresh part.
This works for me and is probably not that far from a "proper" solution (e.g., the sparse case is not covered).
def update_graph(index: pynndescent.NNDescent, xs_fresh):
ns, _ = index.neighbor_graph
n, k = ns.shape
n_fresh = xs_fresh.shape[0]
ns = np.vstack((ns, np.random.randint(0, n + n_fresh, size=(n_fresh, k))))
data = np.vstack((index._raw_data, xs_fresh))
return pynndescent.NNDescent(data, metric=my_manhattan, n_neighbors=k, init_graph=ns)
def test3_2():
n_x = 100
xs = np.random.rand(n_x, 5)
xs[:, 0] = np.arange(0, n_x, 1)
n_y = 105
ys = np.random.rand(n_y, 5)
ys[:, 0] = np.arange(n_x, n_x + n_y, 1)
index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=25)
index.prepare()
k = 10
ns_x, ds_x = index.query(xs, k=k)
index = update_graph(index, ys)
ns_y, ds_y = index.query(ys, k=k)
for n_y in ns_y:
print(n_y.tolist())
print(" ")
for d_y in ds_y:
print(d_y.tolist())
z = index.neighbor_graph
print(z)
Yes, the init_graph
option should do the job; it's not quite as good as what update
should be, but hopefully will provide a workaround until I can get it working properly. Your use case is definitely intended, it just wasn't properly debugged apparently.
So in good news this actually looks quite fixable. There are just some annoying kinks to work out to make sure it is reasonably smooth. Hopefully I'll have a fix in shortly.
I have tested the solution (mostly for its speed as compared to the init_graph
approach - indeed, the proper solution is considerably faster and seems to give the right neighbours). One update seems to work, even two in a row. Strangely, after the third update, things start to go wrong:
SpaceEfficiencyWarning
(also note that query time in this round is considerably longer that the previous ones (cca. 50 seconds),IndexError
.Here are the code and its full output:
import numpy as np
import pynndescent
import numba
import time
np.random.seed(1234)
@numba.njit()
def my_manhattan(x, y):
return np.sum(np.abs(x - y))
def test3_2():
n_x = 10**5
xs = np.random.rand(n_x, 5)
xs[:, 0] = np.arange(0, n_x, 1)
n_y = 10**5
t0 = time.time()
index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=25)
t1 = time.time()
print("init1", t1 - t0)
index.prepare()
t2 = time.time()
print("prep1", t2 - t1)
k = 10
ns_x, ds_x = index.query(xs, k=k)
print("quer1", time.time() - t2)
for i in range(6):
print("#################################")
print(f"round {i + 1}")
print("#################################")
t3 = time.time()
ys = np.random.rand(n_y, 5)
ys[:, 0] = np.arange(n_x + i * n_y, n_x + (i + 1) * n_y, 1)
# index = update_graph(index, ys)
index.update(ys)
t4 = time.time()
print("updat", t4 - t3)
index.prepare()
t5 = time.time()
print("prep2", t5 - t4)
ns_y, ds_y = index.query(ys, k=k)
d0 = ds_y[:, 0]
if max(d0) > 10**-10:
too_big = ds_y[:, 0] > 10**-10
ns_y_filtered = ns_y[too_big]
ds_y_filtered = ds_y[too_big]
print("---> WRONG NS?")
for row in ns_y_filtered:
print(row)
print("--------")
for row in ds_y_filtered:
print(row)
t6 = time.time()
print("quer2", t6 - t5)
test3_2()
C:\Users\...\pynnd\Scripts\python.exe C:/Users/..../fast_nn/pynnd_test.py
init1 34.970775842666626
prep1 26.96660089492798
quer1 0.8135221004486084
#################################
round 1
#################################
updat 24.724597454071045
prep2 0.0
quer2 0.7625434398651123
#################################
round 2
#################################
updat 24.66456961631775
prep2 0.0
quer2 0.858508825302124
#################################
round 3
#################################
C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
updat 16.782644987106323
prep2 0.0
quer2 49.37075424194336
#################################
round 4
#################################
C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
File "C:/Users/.../fast_nn/pynnd_test.py", line 129, in <module>
# test3()
File "C:/Users/.../fast_nn/pynnd_test.py", line 118, in test3_2
index.update(ys)
File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1744, in update
self.prepare()
File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1587, in prepare
self._init_search_graph()
File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1126, in _init_search_graph
self._vertex_order = self._search_forest[0].indices
IndexError: list index out of range
I think I found the remaining bug. ~Will create~ Created a pull request
I wanted to do the following:
NNDescent
object from the first batch of data.I assumed that
NNDescent.update(fresh data)
would do the trick. However, maybe I am using it in a wrong way, since it seems that the returned neighbours are only those from the first batch of the data. While inspectingindex = NNDescent(...)
after the update in debug mode I discovered thatindex._raw_data
andindex.neighbor_graph
(andindex._neighbor_graph
) seem to be updated properly (the indices of the neighbours of all 205 instances follow the basic intuition - see below),index._vertex_order
was not updated properly (it still has the length of 100 and not 205).The data (
xs
andys
) were generated, so that they look like this:where
r
is an always different random value (from(0, 1)
). Thus, the nearest neighbours of the instancei
(0 <= i <= 204
) should typically bei
,i - 1
,i + 1
,i - 2
,i + 2
, ... (possibly not in this exact order). However, the code below says that the neighbours of the instancesi >= 100
(those fromys
) are99
,98
,97
... which suggests that the neighbours were found only amongxs
.The actual output of the code below (indices/distances):
The code:
I installed pynndescent via pip.