lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
895 stars 105 forks source link

Is it possible to speed up index generation for time series data? #174

Open danielpastor97 opened 2 years ago

danielpastor97 commented 2 years ago

I have time series data that I need to query, but queries have to be done separately for each time point. Here's my current approach using toy data:

# m is the number of measurements recorded
# d are all recorded data points for each m
# s are separate data points (e.g. from another experiment) we need to query over d for each m
m, d, s = 10, 10000, 10
t = np.random.random(m * d * 3).reshape(m, d, 3).astype('float32')
q = np.random.random(m * s * 3).reshape(m, s, 3).astype('float32')
for mi in range(m):
    xb, xq = t[mi, :], q[mi, :]
    index = pynndescent.NNDescent(xb) # n_jobs does not seem to make a difference.
    index.prepare()
    ii, dd = index.query(xq)
    # ... process ii and dd
# Takes around 23 seconds

Setting up and preparing the index takes up the vast majority of the processing time. Is it possible to speed up this type of queries? The key is that queries have to be done for each t(i) and q(i) pair separately. In our experiments, we can have m in the range 10**4 or more. NNDescent takes n_jobs as arguments, but I do not observe a performance difference with the above code. If I, instead, use joblib for this, I do get a noticeable performance improvement:

def profile_pynnd_joblib(xb, xq):
    index = pynndescent.NNDescent(xb)
    index.prepare()
    ii, dd = index.query(xq)
    return None
# joblib parallelization:
from joblib import Parallel, delayed
Parallel(n_jobs=10)(delayed(prof_pynnd_joblib)(t[mi, :], q[mi, :]) for mi in range(m))
# Takes around 2.5 seconds

Because each measurement m is independent, this parallelization makes sense and is obvious. But I am wondering if there is a smarter way to gain performance improvements. Perhaps in preparing the index, but also perhaps being smarter in how the index is generated? As I mentioned, m can be quite large for us, so, unfortunately, neither of these approaches are currently viable. Suggestions are very welcome.

lmcinnes commented 2 years ago

Assuming you are only ever querying each index once for a small number of points, and assuming the number of samples in each index is not that large, then you may, in fact, be better off with pure brute force. If you care about the nearest neighbors of points in t[mi], or s is quite large, or you want to do multiple queries (possibly with new query sets) for each t[mi] then pynndescent would make sense. But assuming s is small and d is in the low to mid tens of thousands then you only need s * d total distance computations which could be very tractable.