tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 339 forks source link

Faster kNN search with constrained DTW #9

Open rtavenar opened 7 years ago

rtavenar commented 7 years ago

It would be nice to make kNN searches faster when Sakoe-Chiba constrained DTW is concerned using LB_Keogh based pre-filtering.

This should be implemented in the kneighbors method of class KNeighborsTimeSeriesMixin from module neighbors.

unnamedplay-r commented 6 years ago

In addition to the constraint and the LB_Keogh prune, I'd suggest adding in cascading pruning methods described by the UCR suite. It looks like it could be extended to KNN.

NimaSarajpoor commented 2 years ago

@rtavenar (@GillesVandewiele: Would you mind share your thought on this?)

Do you mind if I work on this? I saw issue #407, and I had an idea. Then, I searched for LB_Keogh and found this issue as well!

So, I think something like this can work:

from tslearn.metrics import lb_keogh
from tslearn.metrics import dtw
from tslearn.metrics import cdist_dtw

# calculate LowerBound (LB)
LB = np.full((n, n), 0, dtype=np.float64)
for i in range(n - 1):
  for j in range(i + 1, n):
    d = lb_keogh(T[i], T[j], radius=r)
    LB[i, j] = d
    LB[j, i] = d # redundant! 

kmax = 20 # maximum number of neighbors 

# to keep track of nearest neighbors
D = np.full((n, kmax), np.inf, dtype=np.float64) 
I = np.full((n, kmax), -1, dtype=np.int64)

for i in range(n-1):
  for j in range(i+1, n):
    if LB[i,j] < D[i,-1]: 
      d = dtw(T[i], T[j], global_constraint="sakoe_chiba", sakoe_chiba_radius=r)   
      if d < D[i, -1]:
        pos = np.searchsorted(D[i], d, side='right')

        # update D and I
        D[i, pos+1:] = D[i, pos:-1]
        D[i, pos] = d

        I[i, pos+1:] = I[i, pos:-1]
        I[i, pos] = j

# fill distance matrix (it is exact at the indices stored in `I`)
Dist_matrix = np.full((n, n), np.inf, dtype=np.float64)
for i in range(n):
  for j, nn_i in enumerate(I[i]):
    Dist_matrix[i, nn_i] = D[i, j]

np.fill_diagonal(Dist_matrix, 0)

We can use Dist_matrix for KNN (with k <= kmax).

This is not bad. It shows some improvement. I think the paper mentioned by @unnamedplay-r might be useful as well.

rtavenar commented 2 years ago

@NimaSarajpoor that would be great!

Probably, something that could be useful if you start working on this would be to have some kind of benchmark to highlight the improvement in running times when using pruning, as well as tests that would ensure that the results are similar to those obtained when DTW with Sakoe-Chiba band is sused without pruning.

NimaSarajpoor commented 2 years ago

@rtavenar I got your point! Sure. I will keep you posted if I get to a good stage.

NimaSarajpoor commented 2 years ago

We may take advantage of Upper Bound (UB, which is Euclidean distance) as follows:

I think step2, for all observations, has O(N^2 log N) time complexity. Maybe we should first see if it is worth it (and this may depend on the kmax)