Open Petkomat opened 2 years ago
I agree, that's a bug. I think there are some somewhat sensible ways I can fix that, but I'm not sure when I can get to it.
I just pushed some fixes for this to the master branch. If you have the time to check it out and see if it resolves your problem I would greatly appreciate it.
I pip-uninstalled the version that I got via pip and installed the version from this repository. Now, prepare
works, however, the code crashes in the line neighbors, distances = index.query(xs)
. Here is the full report (together with an irrelevant warning):
C:\Users\...\pynnd\lib\site-packages\scipy\sparse\_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
Traceback (most recent call last):
File "C:/Users/.../fast_nn/mwe.py", line 32, in <module>
neighbors, distances = index.query(xs)
File "C:\Users\...\pynnd\lib\site-packages\pynndescent-0.5.5-py3.7.egg\pynndescent\pynndescent_.py", line 1659, in query
indices = self._vertex_order[indices]
AttributeError: 'NNDescent' object has no attribute '_vertex_order'
Ah yes, I think I see what the problem there would. I think I can fix that; hopefully I'll get to it in the next few days.
Now, I looked into the code. If I understand correctly, the exception is raised because the code finds the hyperplanes, but their normal vectors are of dimension 1 (which is the dimension of dummy normal vectors [-1.0]
).
Since the dimension of the normal vectors equals the dimension of the data, it seems that changing the signature of rp_trees.convert_tree_format(tree, data_size)
to rp_trees.convert_tree_format(tree, data_size, data_dim)
and setting hyperplane_dim
to data_dim
resolves the issue.
Of course, by doing so, we skip the "degenerate-tree" check in rp_tree.dense_hyperplane_dim(tree.hyperplanes)
but it seems that this check should be equivalent to the condition number of examples in data > leaf_size
(present both in make_angular_tree
and make_euclidean_tree
), which can be easily tested if wanted.
However, (why) do we even care whether trees contain more than one node, i.e., at least one hyperplane was found?
I played around a bit (using hyperplane_dim = data_dim
and leaf_size = 200
while N_INTANCES = 100
) and it seems that everything works normally (1D data and/ or degenerate single-node trees).
EDIT: Also all the existing pytests pass (but there is no test yet that covers 1D data).
A yes, I see. The 1D data is going to potentially be an issue. I'll add a test for 1D data and see if I can wring out the last issues.
With regard to the changes you are proposing here. They seem to make sense, but I admit (without line references) I don't entirely follow exactly what you are proposing. I would certainly welcome a PR with these changes if you can manage it.
By mistake, I have also included my implementation of NNDescent.update_with_changed_data
, which gives an option for efficient updates of data that changes with time (as is the case in my usecase). In that method, the updated rows (in raw data) are updated (instead of appended to the existing data). Distances that did not change are kept as they are (this is the role of init_dist
argument in the constructor). If any new data come, "the official" update is called.
(tested in test_update_with_changed_data
)
Problem description:
My dataset cannot be represented as a
N_INSTANCES x N_DIMENSIONS
table of floats: the instances are rows from the main table in a relational database.Thus, my custom distance measure takes the ids of two instances (
i1
andi2
in the code below) as its arguments, and computes the distance between the corresponding instances by accessing data from many tables (imagine computing the distance between two movies, given the actors that appeared in the movies, the genres of the movies, etc.).This is why random projection trees cannot help: randomly dividing instances according to their ids does not make sense.
This is why I use
pynndescent.NNDescent(..., tree_init=False)
. However, whenindex.prepare()
is called, pynndescent again tries to grow some trees and the minimal working example (below) raises an error (no hyper planes found).An ugly solution:
It turns out that providing examples as
[id, id, ..., id]
instead of[id]
resolves the issue (the more examples we have, the more copies of the id are necessary), but I still think that the behaviour of theprepare()
should be considered a bug.A possible solution [needs to be implemented]:
Maybe we could start the search in a random node, or the closest among random
k
nodes, ork
random nodes.Traceback:
Minimal (not)working example: