lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
882 stars 105 forks source link

Cannot use random projection trees #178

Open Petkomat opened 2 years ago

Petkomat commented 2 years ago

Problem description:

My dataset cannot be represented as a N_INSTANCES x N_DIMENSIONS table of floats: the instances are rows from the main table in a relational database.

Thus, my custom distance measure takes the ids of two instances (i1 and i2 in the code below) as its arguments, and computes the distance between the corresponding instances by accessing data from many tables (imagine computing the distance between two movies, given the actors that appeared in the movies, the genres of the movies, etc.).

This is why random projection trees cannot help: randomly dividing instances according to their ids does not make sense.

This is why I use pynndescent.NNDescent(..., tree_init=False). However, when index.prepare() is called, pynndescent again tries to grow some trees and the minimal working example (below) raises an error (no hyper planes found).

An ugly solution:

It turns out that providing examples as [id, id, ..., id] instead of [id] resolves the issue (the more examples we have, the more copies of the id are necessary), but I still think that the behaviour of the prepare() should be considered a bug.

A possible solution [needs to be implemented]:

Maybe we could start the search in a random node, or the closest among random k nodes, or k random nodes.

Traceback:

Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 31, in <module>
    index.prepare()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 1555, in prepare
    self._init_search_graph()
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in _init_search_graph
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\pynndescent_.py", line 963, in <listcomp>
    for tree in rp_forest
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1158, in convert_tree_format
    hyperplane_dim = dense_hyperplane_dim(tree.hyperplanes)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent\rp_trees.py", line 1140, in dense_hyperplane_dim
    raise ValueError("No hyperplanes of adequate size were found!")
ValueError: No hyperplanes of adequate size were found!

Minimal (not)working example:

import pynndescent
import numpy as np
import numba

np.random.seed(1234)

N_INSTANCES = 100
DATA = np.random.random(N_INSTANCES).astype(np.float32)

@numba.njit(
    numba.types.float32(
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
        numba.types.Array(numba.types.float32, 1, "C", readonly=True),
    ),
    fastmath=True,
    locals={
        "i1": numba.types.int64,
        "i2": numba.types.int64,
    }
)
def my_metric(i1_array, i2_array):
    i1 = numba.int64(i1_array[0])
    i2 = numba.int64(i2_array[0])
    return np.abs(DATA[i1] - DATA[i2])

xs = np.array([[i for _ in range(1)] for i in range(N_INSTANCES)]).astype(np.float32)
index = pynndescent.NNDescent(xs, n_jobs=1, metric=my_metric, n_neighbors=1, tree_init=False)
index.prepare()
neighbors, distances = index.query(xs)
lmcinnes commented 2 years ago

I agree, that's a bug. I think there are some somewhat sensible ways I can fix that, but I'm not sure when I can get to it.

lmcinnes commented 2 years ago

I just pushed some fixes for this to the master branch. If you have the time to check it out and see if it resolves your problem I would greatly appreciate it.

Petkomat commented 2 years ago

I pip-uninstalled the version that I got via pip and installed the version from this repository. Now, prepare works, however, the code crashes in the line neighbors, distances = index.query(xs). Here is the full report (together with an irrelevant warning):

C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
  File "C:/Users/.../fast_nn/mwe.py", line 32, in <module>
    neighbors, distances = index.query(xs)
  File "C:\Users\...\pynnd\lib\site-packages\pynndescent-0.5.5-py3.7.egg\pynndescent\pynndescent_.py", line 1659, in query
    indices = self._vertex_order[indices]
AttributeError: 'NNDescent' object has no attribute '_vertex_order'
lmcinnes commented 2 years ago

Ah yes, I think I see what the problem there would. I think I can fix that; hopefully I'll get to it in the next few days.

Petkomat commented 2 years ago

Now, I looked into the code. If I understand correctly, the exception is raised because the code finds the hyperplanes, but their normal vectors are of dimension 1 (which is the dimension of dummy normal vectors [-1.0]).

Since the dimension of the normal vectors equals the dimension of the data, it seems that changing the signature of rp_trees.convert_tree_format(tree, data_size) to rp_trees.convert_tree_format(tree, data_size, data_dim) and setting hyperplane_dim to data_dim resolves the issue.

Of course, by doing so, we skip the "degenerate-tree" check in rp_tree.dense_hyperplane_dim(tree.hyperplanes) but it seems that this check should be equivalent to the condition number of examples in data > leaf_size (present both in make_angular_tree and make_euclidean_tree), which can be easily tested if wanted.

However, (why) do we even care whether trees contain more than one node, i.e., at least one hyperplane was found?

I played around a bit (using hyperplane_dim = data_dim and leaf_size = 200 while N_INTANCES = 100) and it seems that everything works normally (1D data and/ or degenerate single-node trees).

EDIT: Also all the existing pytests pass (but there is no test yet that covers 1D data).

lmcinnes commented 2 years ago

A yes, I see. The 1D data is going to potentially be an issue. I'll add a test for 1D data and see if I can wring out the last issues.

lmcinnes commented 2 years ago

With regard to the changes you are proposing here. They seem to make sense, but I admit (without line references) I don't entirely follow exactly what you are proposing. I would certainly welcome a PR with these changes if you can manage it.

Petkomat commented 2 years ago

By mistake, I have also included my implementation of NNDescent.update_with_changed_data, which gives an option for efficient updates of data that changes with time (as is the case in my usecase). In that method, the updated rows (in raw data) are updated (instead of appended to the existing data). Distances that did not change are kept as they are (this is the role of init_dist argument in the constructor). If any new data come, "the official" update is called.

(tested in test_update_with_changed_data)