lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
899 stars 105 forks source link

NNDescent.update(X) not working as expected #180

Closed Petkomat closed 2 years ago

Petkomat commented 2 years ago

I wanted to do the following:

  1. Create a NNDescent object from the first batch of data.
  2. Query some data.
  3. Wait for some time.
  4. Update the object with new data, but not start from scratch.
  5. Query some data.

I assumed that NNDescent.update(fresh data) would do the trick. However, maybe I am using it in a wrong way, since it seems that the returned neighbours are only those from the first batch of the data. While inspecting index = NNDescent(...) after the update in debug mode I discovered that

The data (xs and ys) were generated, so that they look like this:

xs:
0 r r r r
1 r r r r
2 r r r r
...
99 r r r r

ys:
100 r r r r
101 r r r r
...
204 r r r r

where r is an always different random value (from (0, 1)). Thus, the nearest neighbours of the instance i (0 <= i <= 204) should typically be i, i - 1, i + 1, i - 2, i + 2, ... (possibly not in this exact order). However, the code below says that the neighbours of the instances i >= 100 (those from ys) are 99, 98, 97 ... which suggests that the neighbours were found only among xs.

The actual output of the code below (indices/distances):

[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
[98, 99, 97, 96, 95, 94, 93, 92, 91, 89]
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
... (some lines omitted)
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]

[2.1192119121551514, 2.8882203102111816, 4.930568218231201, 5.635573387145996, 6.850753307342529, 7.7378249168396, 8.205963134765625, 9.593037605285645, 10.798347473144531, 11.230871200561523]
[4.201259613037109, 4.454722881317139, 5.768087387084961, 6.7560529708862305, 7.452332973480225, 8.632887840270996, 9.428315162658691, 9.682640075683594, 11.393900871276855, 13.470145225524902]
[4.810489654541016, 5.09433650970459, 6.398769378662109, 7.656699180603027, 8.318954467773438, 9.736209869384766, 9.871195793151855, 11.387312889099121, 12.959441184997559, 13.922148704528809]
... (some lines omitted)
[105.62753295898438, 106.07768249511719, 107.53001403808594, 108.47374725341797, 109.60338592529297, 110.31566619873047, 110.80841064453125, 112.75614166259766, 114.3282699584961, 114.73919677734375]
[106.41008758544922, 106.78266906738281, 107.53002166748047, 108.84111022949219, 109.72077178955078, 111.25269317626953, 111.68487548828125, 113.38243103027344, 114.24507141113281, 115.30052947998047]

The code:

import numpy as np
import pynndescent
import pickle
import os
import numba

@numba.njit()
def my_manhattan(x, y):
    return np.sum(np.abs(x - y))

def test3():
    n_x = 100
    xs = np.random.rand(n_x, 5)
    xs[:, 0] = np.arange(0, n_x, 1)
    n_y = 105
    ys = np.random.rand(n_y, 5)
    ys[:, 0] = np.arange(n_x, n_x + n_y, 1)
    index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=5)
    # index.prepare()
    k = 10
    ns_x, ds_x = index.query(xs, k=k)
    index.update(ys)
    # index._init_search_graph()
    ns_y, ds_y = index.query(ys, k=k)
    for n_y in ns_y:
        print(n_y.tolist())
    print(" ")
    for d_y in ds_y:
        print(d_y.tolist())

test3()

I installed pynndescent via pip.

lmcinnes commented 2 years ago

It is quite possible there is a bug here -- it is likely we need to rebuild the supporting data (including recompiling the query function) for querying. Thanks for spotting this. I'll see if I can figure something out soon.

The primary use case for the update so far had been to update the neighbor graph (which it seems to do), but querying that hasn't had as much (any?) work.

Petkomat commented 2 years ago

Ok, then I will just try to find a workaround. It seems that simply constructing a new NNDescent object with init_graph=... should do: init graph would contain the "correct" neighbours for already known data and random ones for the fresh part.

Petkomat commented 2 years ago

This works for me and is probably not that far from a "proper" solution (e.g., the sparse case is not covered).

def update_graph(index: pynndescent.NNDescent, xs_fresh):
    ns, _ = index.neighbor_graph
    n, k = ns.shape
    n_fresh = xs_fresh.shape[0]
    ns = np.vstack((ns, np.random.randint(0, n + n_fresh, size=(n_fresh, k))))
    data = np.vstack((index._raw_data, xs_fresh))
    return pynndescent.NNDescent(data, metric=my_manhattan, n_neighbors=k, init_graph=ns)

def test3_2():
    n_x = 100
    xs = np.random.rand(n_x, 5)
    xs[:, 0] = np.arange(0, n_x, 1)
    n_y = 105
    ys = np.random.rand(n_y, 5)
    ys[:, 0] = np.arange(n_x, n_x + n_y, 1)
    index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=25)
    index.prepare()
    k = 10
    ns_x, ds_x = index.query(xs, k=k)
    index = update_graph(index, ys)
    ns_y, ds_y = index.query(ys, k=k)
    for n_y in ns_y:
        print(n_y.tolist())
    print(" ")
    for d_y in ds_y:
        print(d_y.tolist())
    z = index.neighbor_graph
    print(z)
lmcinnes commented 2 years ago

Yes, the init_graph option should do the job; it's not quite as good as what update should be, but hopefully will provide a workaround until I can get it working properly. Your use case is definitely intended, it just wasn't properly debugged apparently.

lmcinnes commented 2 years ago

So in good news this actually looks quite fixable. There are just some annoying kinks to work out to make sure it is reasonably smooth. Hopefully I'll have a fix in shortly.

Petkomat commented 2 years ago

I have tested the solution (mostly for its speed as compared to the init_graph approach - indeed, the proper solution is considerably faster and seems to give the right neighbours). One update seems to work, even two in a row. Strangely, after the third update, things start to go wrong:

Here are the code and its full output:

Code

import numpy as np
import pynndescent
import numba
import time

np.random.seed(1234)

@numba.njit()
def my_manhattan(x, y):
    return np.sum(np.abs(x - y))

def test3_2():
    n_x = 10**5
    xs = np.random.rand(n_x, 5)
    xs[:, 0] = np.arange(0, n_x, 1)
    n_y = 10**5
    t0 = time.time()
    index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=25)
    t1 = time.time()
    print("init1", t1 - t0)
    index.prepare()
    t2 = time.time()
    print("prep1", t2 - t1)
    k = 10
    ns_x, ds_x = index.query(xs, k=k)
    print("quer1", time.time() - t2)
    for i in range(6):
        print("#################################")
        print(f"round {i + 1}")
        print("#################################")
        t3 = time.time()
        ys = np.random.rand(n_y, 5)
        ys[:, 0] = np.arange(n_x + i * n_y, n_x + (i + 1) * n_y, 1)
        # index = update_graph(index, ys)
        index.update(ys)
        t4 = time.time()
        print("updat", t4 - t3)
        index.prepare()
        t5 = time.time()
        print("prep2", t5 - t4)
        ns_y, ds_y = index.query(ys, k=k)
        d0 = ds_y[:, 0]
        if max(d0) > 10**-10:
            too_big = ds_y[:, 0] > 10**-10
            ns_y_filtered = ns_y[too_big]
            ds_y_filtered = ds_y[too_big]
            print("---> WRONG NS?")
            for row in ns_y_filtered:
                print(row)
            print("--------")
            for row in ds_y_filtered:
                print(row)
        t6 = time.time()
        print("quer2", t6 - t5)

test3_2()

Output

C:\Users\...\pynnd\Scripts\python.exe C:/Users/..../fast_nn/pynnd_test.py
init1 34.970775842666626
prep1 26.96660089492798
quer1 0.8135221004486084
#################################
round 1
#################################
updat 24.724597454071045
prep2 0.0
quer2 0.7625434398651123
#################################
round 2
#################################
updat 24.66456961631775
prep2 0.0
quer2 0.858508825302124
#################################
round 3
#################################
C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
updat 16.782644987106323
prep2 0.0
quer2 49.37075424194336
#################################
round 4
#################################
C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
  File "C:/Users/.../fast_nn/pynnd_test.py", line 129, in <module>
    # test3()
  File "C:/Users/.../fast_nn/pynnd_test.py", line 118, in test3_2
    index.update(ys)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1744, in update
    self.prepare()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1587, in prepare
    self._init_search_graph()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1126, in _init_search_graph
    self._vertex_order = self._search_forest[0].indices
IndexError: list index out of range
Petkomat commented 2 years ago

I think I found the remaining bug. ~Will create~ Created a pull request