wannesm / dtaidistance

Time series distances: Dynamic Time Warping (fast DTW implementation in C)
Other
1.08k stars 184 forks source link

'SSMatch' object has no attribute 'segment' #173

Closed tommedema closed 2 years ago

tommedema commented 2 years ago

I'm trying to find the best 3 subseries matches:

import numpy as np
import matplotlib.pyplot as plt
from dtaidistance.subsequence.dtw import subsequence_alignment
from dtaidistance import dtw_visualisation as dtwvis
from dtaidistance.subsequence.dtw import subsequence_search

data = np.cumsum(np.random.uniform(-0.5, 0.5, 1000000))
query = np.cumsum(np.random.uniform(-0.5, 0.5, 100))

series = data

k = 3
s = []
w = 22
ws = int(np.floor(w/2))
wn = int(np.floor((len(series) - (w - ws)) / ws))
si, ei = 0, w
for i in range(wn):
    s.append(series[si:ei])
    si += ws
    ei += ws

sa = subsequence_search(query, s)

kmatches = list(sa.kbest_matches(k=k))
segments = [m.segment for m in kmatches]

print(segments)

But I am unable to retrieve the segments:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/lm/xhqw06kd341ck445l0rpz_cr0000gn/T/ipykernel_34893/2165800808.py in <module>
     25 
     26 kmatches = list(sa.kbest_matches(k=k))
---> 27 segments = [m.segment for m in kmatches]
     28 
     29 print(segments)

/var/folders/lm/xhqw06kd341ck445l0rpz_cr0000gn/T/ipykernel_34893/2165800808.py in <listcomp>(.0)
     25 
     26 kmatches = list(sa.kbest_matches(k=k))
---> 27 segments = [m.segment for m in kmatches]
     28 
     29 print(segments)

AttributeError: 'SSMatch' object has no attribute 'segment'

When I run pip show dtaidistance it says I have version 2.3.9 installed.

wannesm commented 2 years ago

It seems your intention is achieved with subsequence_alignment. The subsequence_search is for when you already have segments in a list and SSMatch will return the index (as idx). With alignment you search a query in one series and get an SAMatch object that has a segment property.

wannesm commented 2 years ago

As a side note, the cumsum will result in a continuously increasing function. Since dtw is sensitive to absolute values it will prefer subsequences at the start. You might need to detrend first. Or split in segments yourself (like in the docs) and normalise each segment.

tommedema commented 2 years ago

@wannesm Perfect, thank you. The results I get with subsequence_alignment seem promising. However, I face two challenges related to windowing and early abandoning and data normalization.

I went ahead and moved the code to a standalone collaborative notebook:

https://colab.research.google.com/drive/1kI2Gf6hGIcZFGxiFS9amMYyhMaW7YIPc?usp=sharing

Windowing

It's unclear how I can now apply a window and enable pruning or early abandoning as documented for the dtw distance function here.

Is it possible to apply a window and max_dist or use_pruning parameters when using subsequence_alignment? I've looked through the documentation but couldn't find such options (I did notice a reference to align_fast but am unsure how it can be invoked).

Early abandon by distance threshold

I am now able to retrieve the segments, and from the documentation for SAMatch I inferred that the value property is the DTW distance (values like [0.005617328125283836, 0.00566894337741499, 0.005688609134028068]). What units are these distances in? I'd like to be able to compare the distances across different time series.

My goal is to set a threshold distance and then find all matches with a distance less than the threshold (rather than an arbitrary best 3 matches). I know that I can loop over the generator from kbest_matches, like so:

k = 100
overlap = False

segments = []
for match in sa.kbest_matches(k = k, overlap = overlap):
    segments.append(match.segment)
    print(match.segment, flush = True)
    print(match.value, flush = True)

And I can break out once a threshold has been met. However, it seems most of the computation is already done earlier in the subsequence_alignment invocation. And so breaking here doesn't actually speed up the computation. Is there a way to pass this information to subsequence_alignment to abandon early and speed up computation?

Normalization

While the graphs returned without normalization seem ok, I wanted to follow your advice re normalization (I assume this is what you mean with detrending). I've tried both z-score and differencing, but both seem to provide odd results:

Without normalization (looks good) [ notebook ]:

Screen Shot 2022-08-06 at 2 24 15 PM

Though it's suspicious that the matches are always of shorter length than the query. Could this be because the data is not normalized?

With z-score normalization [ notebook ]:

Screen Shot 2022-08-06 at 2 31 37 PM

With differencing, where I am using the returned segment indices to graph matches in the original time series [ notebook ]:

Screen Shot 2022-08-06 at 2 34 52 PM
wannesm commented 2 years ago

Interesting questions, thanks for the detailed examples.

First, I probably jumped quickly to normalizing because of the cumsum but missed that the example uses a sampling with positive and negative values. This means that simple detrending is probably not an option. But the plot in the colab shows quite different absolute values for the series and query (not necessarily for all runs of random.uniform), especially towards the end (e.g. I have that query values are in [-2,3] and series in [-10,450], and your last plot with s1, s2, s2 seems to indicate that for you the series is in approx [-10,200]).

For your questions:

Windowing: This is not supported for alignment. The idea of a window to avoid too much expansion or shift cannot be applied readily to alignment because we use this shift to search over the entire series. But since the query is small, the gains would probably be limited anyway.

Early abandoning: Using max_dist is also not supported for alignment. The max_dist feature for DTW makes use of properties that are not true for alignment. There is probably some version of this that would possible but it requires a new algorithm for how max_dist prunes the cumulative cost matrix (now based on values for psi). The question is whether it is saving enough to make it worth the effort (but an interesting idea :-) ).

The resulting values: These are the distance measures returned by DTW but normalized, thus divided by the length of the query (in this case 100). This is for when you want to compare results for multiple queries.

Early abanding outside of kbest_matches: You can get all the best values also directly from the attribute sa.paths. The last row are all possible paths. What kbest_matches does additionally is that it takes overlap into account. Otherwise it would return the same path multiple times with subtle variations. So kbest_matches with a high value for k (or infinity) and breaking out of the loop is probably the fastest approach when you want to avoid overlap. The method is iterative and spends no effort on future matches, even if k is set higher than where you break.

Normalization: You would need to z-normalize per pre-defined window over your series. This of course defeats a bit the advantage of using the alignment method where no window needs to be given. So differencing could be the better option. The results you obtain seem to make sense (I reran your notebook). It is hard to see on the graph you show because of the widely different absolute numbers which flatten all the curves. If you change plt.figure tot fig, ax = plt.subplots(nrows=k+1, ncols=1) and the plot commands to ax[0].plot(query, label = "query") and ax[num].plot(series[match.segment[0]:..., they will look more as expected. Not that if you match to a negative part of the function, it will be the inverse pattern. If this is not intended behavior you need to to dquery = differencing(query+np.min(query), smooth=0.1).

tommedema commented 2 years ago

Thank you @wannesm, super helpful!

You mentioned that the returned DTW is normalized by dividing the distance over the query length. Is it safe to assume that all DTW distances in the library are normalized as such? E.g. if I invoke dtw_distance_fast several times but with different query lengths, can I compare these distances directly, or should I first divide them by the query length?

wannesm commented 2 years ago

The normalisation is only for this alignment method. We followed this convention used in literature for searches with a query. For the normal dtw there is no explicit concept of a query. Both series are equivalent and the result is the same independent of the order of arguments.

It’s a good point that this is confusing. I will also add a distance argument to SAMatch, next to value to avoid confusion.