Open danielpastor97 opened 2 years ago
Assuming you are only ever querying each index once for a small number of points, and assuming the number of samples in each index is not that large, then you may, in fact, be better off with pure brute force. If you care about the nearest neighbors of points in t[mi]
, or s
is quite large, or you want to do multiple queries (possibly with new query sets) for each t[mi]
then pynndescent would make sense. But assuming s
is small and d
is in the low to mid tens of thousands then you only need s * d
total distance computations which could be very tractable.
I have time series data that I need to query, but queries have to be done separately for each time point. Here's my current approach using toy data:
Setting up and preparing the index takes up the vast majority of the processing time. Is it possible to speed up this type of queries? The key is that queries have to be done for each
t(i)
andq(i)
pair separately. In our experiments, we can havem
in the range10**4
or more.NNDescent
takesn_jobs
as arguments, but I do not observe a performance difference with the above code. If I, instead, usejoblib
for this, I do get a noticeable performance improvement:Because each measurement
m
is independent, this parallelization makes sense and is obvious. But I am wondering if there is a smarter way to gain performance improvements. Perhaps in preparing the index, but also perhaps being smarter in how the index is generated? As I mentioned,m
can be quite large for us, so, unfortunately, neither of these approaches are currently viable. Suggestions are very welcome.