tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 340 forks source link

Scalable matrix profile #260

Closed rtavenar closed 4 years ago

rtavenar commented 4 years ago

Is your feature request related to a problem? Please describe. tslearn has a matrix profile module that relies on a naive implementation. Based on a discussion with @seanlaw in #126 we could maybe consider having STUMPY as an optional dependency for this matrixprofile module in order to benefit from their scalable implementations.

Describe the solution you'd like That would require to improve on the existing MatrixProfile class by allowing to pick an implementation (using parameters passed at __init__ time) and the _transform(...) method should call the correct function

One additional thing to check is how stumpy deals with:

I will probably not have time to work on it. If anyone is interested to give a hand on this, feel free to tell.

seanlaw commented 4 years ago

@rtavenar Can you clarify what is it meant by "variable-length time series"? Are you referring to streaming data?

Regarding multi-dimensional time series", STUMPY has support for this in thestumpy.mstumpfunction (and with Dask (distributed server) support viastumpy.mstumped` - note that functions ending with "ed" refer to Dask distributed support) but, with an increase in the number of dimensions, note that the computational cost will be higher.

rtavenar commented 4 years ago

@rtavenar Can you clarify what is it meant by "variable-length time series"? Are you referring to streaming data?

What I mean is that in tslearn, we do not consider single time series but sets of time series. The question is whether STUMPY implementation would make it easy to deal with sets of time series in which all time series do not have the same number of timestamps.

seanlaw commented 4 years ago

What I mean is that in tslearn, we do not consider single time series but sets of time series. The question is whether STUMPY implementation would make it easy to deal with sets of time series in which all time series do not have the same number of timestamps.

I'm not sure if this helps but, technically, STUMPY does not actually take "time" into account (i.e., there is no concept of time). In the dominant case, it just takes a NumPy array (or a vector in the 1-D case) of floating numbers, T, along with a window size, m and computes the matrix profile and matrix profile indices. So, you can analyze a 1D NumPy array of any length. For a set of time series, one would either compute the matrix profile for each time series independently OR one would concatenate the set into one long time series and then separate each time series with a single np.nan value. This latter "trick" allows you to discover motifs across multiple time series while making sure that time series boundaries (i.e., the end of T1 and the beginning of T2) are ignored or not considered as a "relevant".

In the multi-dimensional case, it only makes sense that the data is "aligned" and have the same number of data points. Otherwise, the matrix profile analysis would make little sense but maybe I haven't thought about it enough.

Ultimately, is up to the user to pre-process their data for, say, missing data and to ensure that a multi-dimensional time series already has their data points aligned by "time".

If that isn't clear then please feel free to ask more questions.

tcrasset commented 4 years ago

Ultimately, is up to the user to pre-process their data for, say, missing data and to ensure that a multi-dimensional time series already has their data points aligned by "time".

I agree with this, I don't think it is the library's job to check for NaN values.

Based on a discussion with @seanlaw in #126 we could maybe consider having STUMPY as an optional dependency for this matrixprofile module in order to benefit from their scalable implementations.

In my opinion, this is the best way, there's no need for another matrix profile implementation at the moment, considering that STUMPY is specialized in this task. Moreover, this could free time for implementing algorithms that build on top of the matrix profile. As long as tslearn doesn't have algorithms to use the matrix profile with, using STUMPY as an optional dependency is the best bet in my opinion.

I am happy to give this a try, this will help me get familiar with tslearn!

seanlaw commented 4 years ago

I agree with this, I don't think it is the library's job to check for NaN values.

For the record, STUMPY does currently "support" data that contains NaN/inf in the sense that it will not error out (since mean and stddev computations would typically have problems with those values) and gracefully set distances to inf for any subsequence that contains a NaN/inf.

I am happy to give this a try, this will help me get familiar with tslearn!

@tcrasset Let me know if you have any questions about STUMPY or need any help there!

seanlaw commented 4 years ago

Btw, I noticed in the matrix profile user guide that there is a concept of a "segment". Does this mean that you will always have a segment that you are trying to search for amongst a set of time series? If that's the case, then this might be an easier problem to solve than when you don't know whether or not a pattern (or "repeating segment") even exists.

tcrasset commented 4 years ago

@seanlaw Thanks, will do! I'm fairly new to all this, should I just ping you here if I have a question?

seanlaw commented 4 years ago

@tcrasset Yep, just @ me and it looks like I'm already subscribed to this issue. I may not be familiar with tslearn inner workings but I can assist with the STUMPY bits

rtavenar commented 4 years ago

@tcrasset and I'll be there for the more tslearn-related stuff ;)

rtavenar commented 4 years ago

Btw, I noticed in the matrix profile user guide that there is a concept of a "segment". Does this mean that you will always have a segment that you are trying to search for amongst a set of time series? If that's the case, then this might be an easier problem to solve than when you don't know whether or not a pattern (or "repeating segment") even exists.

No, what we do is the standard MP stuff described in your tutorial. I think @GillesVandewiele reported a segment for the purpose of illustration.

GillesVandewiele commented 4 years ago

Indeed, this "segment" notation is used to draw an analogy with our shapelet learning module as the distance calculation between shapelets and time series is very similar to how the matrix profile is constructed. But by pointing this out, I now acknowledge that this analogy could cause more confusion than being helpful so I might scratch that...

tcrasset commented 4 years ago

@rtavenar At the moment, in matrix_profile.py you process all the dimensions at once, by creating a self.subsequence_length x d sliding window if I understand correctly. Is that what you wish to do going forward? Should we only implement multi-dimensional matrix profile? And then it is the user's choice to input either 1 dimension at a time or all of them ?

I compared the output of STUMPY matrix profile and the tslearn one with the Trace data set, and they do not match in some places. I'll investigate this further.

Moreover, tslearn does not have the notion of matrix profile index but I guess that would not be too hard to add. Do you want to completely scrap your implementation, or leave it be and add an option to use stumpy for the matrix profile computation?

I'll try to have a [WIP] pull-request by this evening.

seanlaw commented 4 years ago

@tcrasset Note that the a) multi-dimensional matrix profile output (and its calculation) for, say, d number of time series is fundamentally different from b) computing d individual matrix profiles. That is, a multi-dimensional matrix profile (computed via stumpy.mstump or stumpy.mstumped) is not equivalent to stacking 1-D matrix profiles on top of each other!

And then, depending on what you insights you are after, one might also c) concatenate the d time series (separated by an np.nan) and computing a single 1D matrix profile. Unlike computing the matrix profiles individually, for any given subsequence, this concatenation (separated by np.nan) allows you to find its nearest neighbor across all d time series. Whereas, in b), any given subsequence can only find a nearest neighbor within its own time series (i.e., subsequence T1[i : i+m] will only match another subsequence T1[j : j+m] within T1)

For the purpose of speed and compatibility with stumpy.stump (or stumpy.stumped) and stumpy.gpu_stump, I'm guessing that you are actually mostly interested in scenarios b) and c).

Moreover, tslearn does not have the notion of matrix profile index but I guess that would not be too hard to add.

In case it matters, I actually think that there is a ton of potential for incorporating matrix profile indices for building ML models. For instance, if there's a roughly periodic nature to your time series data then, in addition to the matrix profile values, you can use information (for a given subsequence) like "how many time units away is my nearest neighbor?" or "how far away is my left/right nearest neighbor?". This is just the start.

And for completeness:

  1. stumpy.stump, stumpy.stumped, and stumpy.gpu_stump are all for 1D time series and should all generate the exact same output: an np.ndarray where the first column is the matrix profile values, the second column is the corresponding matrix profile indices (i.e., where is your nearest neighbor located), the third column is the corresponding "left" matrix profile indices (i.e., limited to relative index locations that are on your "left", where is your nearest left neighbor), and the fourth column is the corresponding "right" matrix profile indices (i.e., limited to relative index locations that are on your "right", where is your nearest right neighbor)
  2. You may be confused about the difference between, say, stumpy.stump and stumpy.stumped (or stumpy.mstump and stumpy.mstumped). Any time you see a top level function ending in "ed", this implies that this function supports matrix profile computations across a distributed cluster of servers via Dask distrbuted. So, in all cases, since STUMPY is Numba JIT compiled, it will always try to use all of the CPUs on single server and compute matrix profiles in parallel. The Dask distributed versions simply take these computations and divides them up in an embarrassingly parallel fashion and spreads them across multiple servers (while also using all CPUs on each server).
  3. Here is a Google Colab notebook demonstrating how to use stumpy.gpu_stump. Aside from the function name, the function arguments are (purposely) nearly identical to stumpy.stump.
rtavenar commented 4 years ago

@seanlaw : Concerning the difference between a) and b), when I read https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstump I tend to understand that stumpy.mstump stacks 1d matrix profiles. Is it the doc that is misleading, or am I missing something.

Then, in tslearn, we assume that the dimensions are (n_series, n_timestamps, n_features) so I would assume that local distances on which MP rely are computed between multidimensional subseries of shape (m, n_features).

Finally, if I understand, it would make sense to implement both intra-series and across-series MP. We could make that tunable as an __init__ parameter I guess.

rtavenar commented 4 years ago

@tcrasset maybe you can have a look at #261 to see if the fixes there are likely to make things better.

seanlaw commented 4 years ago

Concerning the difference between a) and b), when I read https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstump I tend to understand that stumpy.mstump stacks 1d matrix profiles. Is it the doc that is misleading, or am I missing something.

@rtavenar I understand the possibility for confusion. mstump follows the definition of a multi-dimensional matrix profile according to "Definition 12" of the original multi-dimensional matrix profile paper. It is certainly worth the read but from what little I can tell, you are looking for the 1D variants.

Technically, each "row" of the multi-dimensional matrix profile corresponds to the number of dimensions used to generate the "best" (lowest averaged value) matrix profile. So, for the first row/dimension, at each i (along the time series T), the algorithm sorts the matrix profile values across all dimensions and chooses the lowest value and sets that as the matrix profile value for the first dimension. For the second row/dimension, at each i (along the time series T), the algorithm sorts the matrix profile values across all dimensions and chooses the lowest two values and sets the average of these two values as the matrix profile value for the first dimension. And so on and so forth. Essentially, if the total number of time series is equal to 10, then, say, (at each i) the second row of the matrix profile will go through all 10 dimensions and extract the 2 (out of 10) dimensions that will produce the lowest average matrix profile value. However, it doesn't tell you which 2 dimensions produced that value. Additionally, for a different subsequence, j = i + 20, (i.e., a nearby subsequence), the 2 dimensions that produce the lowest average matrix profile value can be different from the 2 dimensions for subsequence i.

We are currently working on a Tutorial for Multi-dimensional Matrix Profiles but it's a lot trickier than you might expect.

Thus, the multi-dimensional matrix profile is not a stacking of 1D matrix profiles. This is the mSTOMP/mSTAMP algorithm and this is what a multi-dimensional matrix profile is defined as. Perhaps, I can add a note in the docstring (and in the upcoming tutorial) to explicitly state this?

Then, in tslearn, we assume that the dimensions are (n_series, n_timestamps, n_features) so I would assume that local distances on which MP rely are computed between multidimensional subseries of shape (m, n_features).

Maybe you can clarify but in (m, n_features) is m referring to the mth time series amongst the set of n_series? And is n_features just the raw floating point data for the mth time series?

STUMPY will never perform pairwise subsequence distance calculations across multiple time series. Pairwise distance calculations are always performed within the same 1D time series.

I don't think I full understand what you mean when you say " would assume that local distances on which MP rely are computed between multidimensional subseries". Would you mind providing an example?

Finally, if I understand, it would make sense to implement both intra-series and across-series MP. We could make that tunable as an init parameter I guess.

For "across-series", you'll have to concatenate the time series (separated by np.nan) before calling stumpy.stump. Even in the stumpy.mstump case, it doesn't search for a subsequence across all time series.

tcrasset commented 4 years ago

@seanlaw I think you meant to tag @rtavenar

Maybe you can clarify but in (m, n_features) is m referring to the mth time series amongst the set of n_series? And is n_features just the raw floating point data for the mth time series?

(n_series, n_timestamps, n_features) means we have n_series time series of n_features dimensions, each of them having n_timestamps time steps. A single multi-dimensional time series has shape (n_timestamps, n_features) if you will. The sliding window is on a per series basis (e.g. series 0 out of n_series), and in a given series, the sliding window has a subsequence length m and spans across n_features dimensions. How exactly the computation goes from there, I am not sure, but it is some kind of pairwise distance computation.

seanlaw commented 4 years ago

I think you meant to tag @rtavenar

Thank you!

(n_series, n_timestamps, n_features) means we have n_series time series of n_features dimensions, each of them having n_timestamps time steps. A single multi-dimensional time series has shape (n_timestamps, n_features) if you will.

Okay, just so I have this right. If my "set" contains only 2 time series and they each contain 5 timestamps and are 3 dimensional:

T0 = np.array([[0.00, 0.01, 0.02, 0.03, 0.04], 
               [0.10, 0.11, 0.12, 0.13, 0.14], 
               [0.20, 0.21, 0.22, 0.23, 0.24]])

T1 = np.array([[1.00, 1.01, 1.02, 1.03, 1.04], 
               [1.10, 1.11, 1.12, 1.13, 1.14], 
               [1.20, 1.21, 1.22, 1.23, 1.24]])

And so the set of T0 and T1 would produce (n_series, n_timestamps, n_features) = (2, 5, 3). Is my understanding correct?

If so, I believe that a multi-dimensional time series in STUMPY has shape (n_features, n_timestamps)

The sliding window is on a per series basis (e.g. series 0 out of n_series), and in a given series, the sliding window has a subsequence length m and spans across n_features dimensions. How exactly the computation goes from there, I am not sure, but it is some kind of pairwise distance computation.

This makes sense. In STUMPY, for a given series, pairwise distance calculations never span across multiple dimensions/features (i.e., say, subsequences in feature_1 are only compared to other subsequences in feature_1).