Closed rtavenar closed 4 years ago
@rtavenar Can you clarify what is it meant by "variable-length time series"? Are you referring to streaming data?
Regarding multi-dimensional time series", STUMPY has support for this in the
stumpy.mstumpfunction (and with Dask (distributed server) support via
stumpy.mstumped` - note that functions ending with "ed" refer to Dask distributed support) but, with an increase in the number of dimensions, note that the computational cost will be higher.
@rtavenar Can you clarify what is it meant by "variable-length time series"? Are you referring to streaming data?
What I mean is that in tslearn
, we do not consider single time series but sets of time series. The question is whether STUMPY implementation would make it easy to deal with sets of time series in which all time series do not have the same number of timestamps.
What I mean is that in tslearn, we do not consider single time series but sets of time series. The question is whether STUMPY implementation would make it easy to deal with sets of time series in which all time series do not have the same number of timestamps.
I'm not sure if this helps but, technically, STUMPY does not actually take "time" into account (i.e., there is no concept of time). In the dominant case, it just takes a NumPy array (or a vector in the 1-D case) of floating numbers, T
, along with a window size, m
and computes the matrix profile and matrix profile indices. So, you can analyze a 1D NumPy array of any length. For a set of time series, one would either compute the matrix profile for each time series independently OR one would concatenate the set into one long time series and then separate each time series with a single np.nan
value. This latter "trick" allows you to discover motifs across multiple time series while making sure that time series boundaries (i.e., the end of T1
and the beginning of T2
) are ignored or not considered as a "relevant".
In the multi-dimensional case, it only makes sense that the data is "aligned" and have the same number of data points. Otherwise, the matrix profile analysis would make little sense but maybe I haven't thought about it enough.
Ultimately, is up to the user to pre-process their data for, say, missing data and to ensure that a multi-dimensional time series already has their data points aligned by "time".
If that isn't clear then please feel free to ask more questions.
Ultimately, is up to the user to pre-process their data for, say, missing data and to ensure that a multi-dimensional time series already has their data points aligned by "time".
I agree with this, I don't think it is the library's job to check for NaN values.
Based on a discussion with @seanlaw in #126 we could maybe consider having STUMPY as an optional dependency for this matrixprofile module in order to benefit from their scalable implementations.
In my opinion, this is the best way, there's no need for another matrix profile implementation at the moment, considering that STUMPY is specialized in this task. Moreover, this could free time for implementing algorithms that build on top of the matrix profile. As long as tslearn doesn't have algorithms to use the matrix profile with, using STUMPY as an optional dependency is the best bet in my opinion.
I am happy to give this a try, this will help me get familiar with tslearn!
I agree with this, I don't think it is the library's job to check for NaN values.
For the record, STUMPY does currently "support" data that contains NaN/inf in the sense that it will not error out (since mean and stddev computations would typically have problems with those values) and gracefully set distances to inf
for any subsequence that contains a NaN/inf.
I am happy to give this a try, this will help me get familiar with tslearn!
@tcrasset Let me know if you have any questions about STUMPY or need any help there!
Btw, I noticed in the matrix profile user guide that there is a concept of a "segment". Does this mean that you will always have a segment that you are trying to search for amongst a set of time series? If that's the case, then this might be an easier problem to solve than when you don't know whether or not a pattern (or "repeating segment") even exists.
@seanlaw Thanks, will do! I'm fairly new to all this, should I just ping you here if I have a question?
@tcrasset Yep, just @ me and it looks like I'm already subscribed to this issue. I may not be familiar with tslearn inner workings but I can assist with the STUMPY bits
@tcrasset and I'll be there for the more tslearn
-related stuff ;)
Btw, I noticed in the matrix profile user guide that there is a concept of a "segment". Does this mean that you will always have a segment that you are trying to search for amongst a set of time series? If that's the case, then this might be an easier problem to solve than when you don't know whether or not a pattern (or "repeating segment") even exists.
No, what we do is the standard MP stuff described in your tutorial. I think @GillesVandewiele reported a segment for the purpose of illustration.
Indeed, this "segment" notation is used to draw an analogy with our shapelet learning module as the distance calculation between shapelets and time series is very similar to how the matrix profile is constructed. But by pointing this out, I now acknowledge that this analogy could cause more confusion than being helpful so I might scratch that...
@rtavenar At the moment, in matrix_profile.py
you process all the dimensions at once, by creating a self.subsequence_length
x d
sliding window if I understand correctly. Is that what you wish to do going forward? Should we only implement multi-dimensional matrix profile? And then it is the user's choice to input either 1 dimension at a time or all of them ?
I compared the output of STUMPY matrix profile and the tslearn one with the Trace data set, and they do not match in some places. I'll investigate this further.
Moreover, tslearn does not have the notion of matrix profile index but I guess that would not be too hard to add. Do you want to completely scrap your implementation, or leave it be and add an option to use stumpy for the matrix profile computation?
I'll try to have a [WIP] pull-request by this evening.
@tcrasset Note that the a) multi-dimensional matrix profile output (and its calculation) for, say, d
number of time series is fundamentally different from b) computing d
individual matrix profiles. That is, a multi-dimensional matrix profile (computed via stumpy.mstump
or stumpy.mstumped
) is not equivalent to stacking 1-D matrix profiles on top of each other!
And then, depending on what you insights you are after, one might also c) concatenate the d
time series (separated by an np.nan
) and computing a single 1D matrix profile. Unlike computing the matrix profiles individually, for any given subsequence, this concatenation (separated by np.nan
) allows you to find its nearest neighbor across all d
time series. Whereas, in b), any given subsequence can only find a nearest neighbor within its own time series (i.e., subsequence T1[i : i+m]
will only match another subsequence T1[j : j+m]
within T1
)
For the purpose of speed and compatibility with stumpy.stump
(or stumpy.stumped) and stumpy.gpu_stump
, I'm guessing that you are actually mostly interested in scenarios b) and c).
Moreover, tslearn does not have the notion of matrix profile index but I guess that would not be too hard to add.
In case it matters, I actually think that there is a ton of potential for incorporating matrix profile indices for building ML models. For instance, if there's a roughly periodic nature to your time series data then, in addition to the matrix profile values, you can use information (for a given subsequence) like "how many time units away is my nearest neighbor?" or "how far away is my left/right nearest neighbor?". This is just the start.
And for completeness:
stumpy.stump
, stumpy.stumped
, and stumpy.gpu_stump
are all for 1D time series and should all generate the exact same output: an np.ndarray
where the first column is the matrix profile values, the second column is the corresponding matrix profile indices (i.e., where is your nearest neighbor located), the third column is the corresponding "left" matrix profile indices (i.e., limited to relative index locations that are on your "left", where is your nearest left neighbor), and the fourth column is the corresponding "right" matrix profile indices (i.e., limited to relative index locations that are on your "right", where is your nearest right neighbor)stumpy.stump
and stumpy.stumped
(or stumpy.mstump
and stumpy.mstumped
). Any time you see a top level function ending in "ed", this implies that this function supports matrix profile computations across a distributed cluster of servers via Dask distrbuted
. So, in all cases, since STUMPY is Numba
JIT compiled, it will always try to use all of the CPUs on single server and compute matrix profiles in parallel. The Dask distributed versions simply take these computations and divides them up in an embarrassingly parallel fashion and spreads them across multiple servers (while also using all CPUs on each server).stumpy.gpu_stump
. Aside from the function name, the function arguments are (purposely) nearly identical to stumpy.stump
.@seanlaw : Concerning the difference between a) and b), when I read https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstump I tend to understand that stumpy.mstump
stacks 1d matrix profiles. Is it the doc that is misleading, or am I missing something.
Then, in tslearn
, we assume that the dimensions are (n_series, n_timestamps, n_features)
so I would assume that local distances on which MP rely are computed between multidimensional subseries of shape (m, n_features)
.
Finally, if I understand, it would make sense to implement both intra-series and across-series MP. We could make that tunable as an __init__
parameter I guess.
@tcrasset maybe you can have a look at #261 to see if the fixes there are likely to make things better.
Concerning the difference between a) and b), when I read https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstump I tend to understand that stumpy.mstump stacks 1d matrix profiles. Is it the doc that is misleading, or am I missing something.
@rtavenar I understand the possibility for confusion. mstump
follows the definition of a multi-dimensional matrix profile according to "Definition 12" of the original multi-dimensional matrix profile paper. It is certainly worth the read but from what little I can tell, you are looking for the 1D variants.
Technically, each "row" of the multi-dimensional matrix profile corresponds to the number of dimensions used to generate the "best" (lowest averaged value) matrix profile. So, for the first row/dimension, at each i
(along the time series T
), the algorithm sorts the matrix profile values across all dimensions and chooses the lowest value and sets that as the matrix profile value for the first dimension. For the second row/dimension, at each i
(along the time series T
), the algorithm sorts the matrix profile values across all dimensions and chooses the lowest two values and sets the average of these two values as the matrix profile value for the first dimension. And so on and so forth. Essentially, if the total number of time series is equal to 10, then, say, (at each i
) the second row of the matrix profile will go through all 10 dimensions and extract the 2 (out of 10) dimensions that will produce the lowest average matrix profile value. However, it doesn't tell you which 2 dimensions produced that value. Additionally, for a different subsequence, j = i + 20
, (i.e., a nearby subsequence), the 2 dimensions that produce the lowest average matrix profile value can be different from the 2 dimensions for subsequence i
.
We are currently working on a Tutorial for Multi-dimensional Matrix Profiles but it's a lot trickier than you might expect.
Thus, the multi-dimensional matrix profile is not a stacking of 1D matrix profiles. This is the mSTOMP
/mSTAMP
algorithm and this is what a multi-dimensional matrix profile is defined as. Perhaps, I can add a note in the docstring (and in the upcoming tutorial) to explicitly state this?
Then, in tslearn, we assume that the dimensions are (n_series, n_timestamps, n_features) so I would assume that local distances on which MP rely are computed between multidimensional subseries of shape (m, n_features).
Maybe you can clarify but in (m, n_features)
is m
referring to the mth
time series amongst the set of n_series
? And is n_features
just the raw floating point data for the mth
time series?
STUMPY will never perform pairwise subsequence distance calculations across multiple time series. Pairwise distance calculations are always performed within the same 1D time series.
I don't think I full understand what you mean when you say " would assume that local distances on which MP rely are computed between multidimensional subseries". Would you mind providing an example?
Finally, if I understand, it would make sense to implement both intra-series and across-series MP. We could make that tunable as an init parameter I guess.
For "across-series", you'll have to concatenate the time series (separated by np.nan
) before calling stumpy.stump
. Even in the stumpy.mstump
case, it doesn't search for a subsequence across all time series.
@seanlaw I think you meant to tag @rtavenar
Maybe you can clarify but in (m, n_features) is m referring to the mth time series amongst the set of n_series? And is n_features just the raw floating point data for the mth time series?
(n_series, n_timestamps, n_features)
means we have n_series
time series of n_features
dimensions, each of them having n_timestamps
time steps.
A single multi-dimensional time series has shape (n_timestamps, n_features)
if you will.
The sliding window is on a per series basis (e.g. series 0 out of n_series
), and in a given series, the sliding window has a subsequence length m
and spans across n_features
dimensions.
How exactly the computation goes from there, I am not sure, but it is some kind of pairwise distance computation.
I think you meant to tag @rtavenar
Thank you!
(n_series, n_timestamps, n_features) means we have n_series time series of n_features dimensions, each of them having n_timestamps time steps. A single multi-dimensional time series has shape (n_timestamps, n_features) if you will.
Okay, just so I have this right. If my "set" contains only 2 time series and they each contain 5 timestamps and are 3 dimensional:
T0 = np.array([[0.00, 0.01, 0.02, 0.03, 0.04],
[0.10, 0.11, 0.12, 0.13, 0.14],
[0.20, 0.21, 0.22, 0.23, 0.24]])
T1 = np.array([[1.00, 1.01, 1.02, 1.03, 1.04],
[1.10, 1.11, 1.12, 1.13, 1.14],
[1.20, 1.21, 1.22, 1.23, 1.24]])
And so the set of T0
and T1
would produce (n_series, n_timestamps, n_features) = (2, 5, 3)
. Is my understanding correct?
If so, I believe that a multi-dimensional time series in STUMPY has shape (n_features, n_timestamps)
The sliding window is on a per series basis (e.g. series 0 out of n_series), and in a given series, the sliding window has a subsequence length m and spans across n_features dimensions. How exactly the computation goes from there, I am not sure, but it is some kind of pairwise distance computation.
This makes sense. In STUMPY, for a given series, pairwise distance calculations never span across multiple dimensions/features (i.e., say, subsequences in feature_1
are only compared to other subsequences in feature_1
).
Is your feature request related to a problem? Please describe.
tslearn
has a matrix profile module that relies on a naive implementation. Based on a discussion with @seanlaw in #126 we could maybe consider having STUMPY as an optional dependency for this matrixprofile module in order to benefit from their scalable implementations.Describe the solution you'd like That would require to improve on the existing
MatrixProfile
class by allowing to pick an implementation (using parameters passed at__init__
time) and the_transform(...)
method should call the correct functionOne additional thing to check is how stumpy deals with:
I will probably not have time to work on it. If anyone is interested to give a hand on this, feel free to tell.