giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
831 stars 170 forks source link

giotto-tda and STUMPY #143

Open ulupo opened 4 years ago

ulupo commented 4 years ago

Description

The STUMPY library contains high-performance (using Numba and/or Dask) code for calculating the so-called "matrix profile" of a time series. There is some overlap in both methods and intent between the calculation of the matrix profile and the pipeline going from a time series to a persistent homology computation via the Takens embedding. It might be worth investigating how much this can be exploited.

seanlaw commented 3 years ago

Hi, @ulupo!I am a bit late to the game here but, as the creator of STUMPY, is there anything I can do to help or any matrix profile related questions I can answer?

ulupo commented 3 years ago

Hey Sean! Great to see you here! As you might have guessed, this went on the backburner somewhat. Still, it would be interesting to pursue this further. I am guessing the connection I am hinting at is clear to you? Essentially, my (very basic) understanding of STOMP is that one creates d-dimensional vectors of the form X = (xt, x{t+1}, ..., x_{t+d-1}) for each time t, and, given a time window on the data and one such vector taken as reference, looks for minima of the Euclidean (or other d-dimensional) distance with all other vectors in the window. These minima correspond to the motif given by X repeating itself elsewhere in the window.

Is this understanding correct, to first order? If so, then the connection with what's done in giotto-tda would be that in both cases we are performing time-delay embeddings to create high-dimensional vectors, and figuring out information about the time series by looking at the resulting point clouds. I'd say that STUMPY focuses very successfully on what we would describe as the "0-dimensional homology". In giotto-tda, we make sliding windows over the original time series, in each of these apply a time-delay embedding to obtain a point cloud, and for each point cloud in the resulting (time-indexed) collection of point clouds we compute a topological profile which includes information in 0 dimensions (similar to returning the full dendrogram of single linkage clustering), but also information in higher dimensions. For instance, 1-dimensional homology captures the presence of loops, and this indicates periodicity in the initial time series.

Some vague ideas that could form the basis for a discussion:

seanlaw commented 3 years ago

@ulupo Thank you for the description! First, you'll have to excuse my ignorance as it pertains to TDA as I have limited exposure to this area (but I am very interested in learning) and, more importantly, I am even less familiar with the terminology.

Essentially, my (very basic) understanding of STOMP is that one creates d-dimensional vectors of the form X = (xt, x{t+1}, ..., x_{t+d-1}) for each time t, and, given a time window on the data and one such vector taken as reference, looks for minima of the Euclidean (or other d-dimensional) distance with all other vectors in the window. These minima correspond to the motif given by X repeating itself elsewhere in the window. Is this understanding correct, to first order?

Yes, if I understand you correctly, then this is consistent with my understanding of STOMP. In the time series literature, X is often referred to as a d-dimensional "subsequence" of time series, T, which has length n (i.e., d < n). In STOMP (and other variants), we are basically performing a brute-force scan and computing all pairwise distances between all subsequences and then recording only the minimum distance (i.e., for each subsequence we only record the distance to its one nearest neighbor).

Does this match with your mental model? Hopefully, I'm not misinterpreting your description

If so, then the connection with what's done in giotto-tda would be that in both cases we are performing time-delay embeddings to create high-dimensional vectors, and figuring out information about the time series by looking at the resulting point clouds.

I am unclear and curious about this point. Would you mind elaborating on this with more detail? Above, you had used d-dimensional to refer to the length (or number of contiguous data points) of subsequence, X. When you say "create high-dimensional vectors", are you referring to computing pairwise distances using many different values of d or are you referring to something else?

I'd say that STUMPY focuses very successfully on what we would describe as the "0-dimensional homology". In giotto-tda, we make sliding windows over the original time series, in each of these apply a time-delay embedding to obtain a point cloud, and for each point cloud in the resulting (time-indexed) collection of point clouds we compute a topological profile which includes information in 0 dimensions (similar to returning the full dendrogram of single linkage clustering), but also information in higher dimensions.

It sounds like each window (or subsequence, X) will be surrounded by a "point cloud" and this cloud represents the distance between all other subsequences in the time series relative to X. Is that correct? I am curious if this means that the point clouds for each X must be fully obtained before the topological profile can be computed? If so, for a time series with, say, 1-100 million data points, would obtaining the all point clouds be a performance bottleneck?

For instance, 1-dimensional homology captures the presence of loops, and this indicates periodicity in the initial time series.

This is very interesting!

Can STUMPY benefit from the heuristics we use for automatically selecting an embedding dimension (d) and a stride between consecutive coordinates in the embedded vectors? These are described here (API ref) and are based on ideas from the literature on dynamical systems.

Yes! The choice of d is a major issue and often requires some level of domain expertise. Otherwise, there is some research or computing something called a pan-matrix-profile, which claims to be "better" than choosing some d, computing a matrix profile, double the size of d, compute a new matrix profile, and repeat this until d exceeds the length of your time series. I think that STUMPY users could benefit from using some heuristic for automatically choosing a few reasonably good ds. Thank you, I will take a look at the API ref!

Would it make sense to perform, say, hierarchical clustering on the point clouds resulting from time-delay embeddings, to identify clusters of motifs in the data? I imagine this is something you understand quite well.

Hmmm, I don't know enough to have any opinion. It is an interesting idea. :)

Btw, this is a wonderful conversation and it is really challenging my brain! Thank you for taking the time to discuss, engage, and teaching me something new!

ulupo commented 3 years ago

@seanlaw thanks for reacting so quickly and for showing so much interest! Let me try to answer your questions:

I am unclear and curious about this point. Would you mind elaborating on this with more detail? Above, you had used d-dimensional to refer to the length (or number of contiguous data points) of subsequence, X. When you say "create high-dimensional vectors", are you referring to computing pairwise distances using many different values of d or are you referring to something else?

When I say "high-dimensional vectors" I just referred to the subsequences X, which are ordered d-tuples of real numbers and hence can be thought of geometrically as position vectors in a d-dimensional Euclidean space. Let me know if this is still unclear.

It sounds like each window (or subsequence, X) will be surrounded by a "point cloud" and this cloud represents the distance between all other subsequences in the time series relative to X. Is that correct?

Let me see if I can say it in a different way: for each (integer) time t, let X_t denote the subsequence (xt, x{t+1}, ..., x_{t + d - 1}) of time series T. By the above, I am regarding Xt as a vector in d-dimensional Euclidean space. More generally, we could also have a "stride" s and something like X{t, s} = (xt, x{t+s}, ..., x_{t + (d-1)s}), but in the interest of notational clarity let me not do that here. Let also N > d be a certain (integer) time duration which will define sliding windows over T. For each t we can consider the window of size N starting at t. In this window, we can find N - d subsequences, namely Xt, X{t+1}, ..., X_{t+N-d-1}. Since we are seeing these as vectors in Euclidean space, we have defined a correspondence between sliding windows of size N and d-dimensional point clouds of size N-d. Topological methods, in their most common "global" form, can extract scalar features from entire point clouds. It is important to note that these features are not point-wise. There are "degree 0" features which tell you about multi-scale connectivity and are essentially the values of merge events in the full dendrogram of single-linkage hierarchical clustering. There are also "degree 1" features which sound something like: "there is a loop in your point cloud which appears at distance scale r_1 and is filled in (so there is no longer a hole inside it) at distance scale r_2". As I was hinting at, degree 1 features are interesting because of periodicity, but degree 0 features are where the link with STOMP is stronger. Re performance: computing sliding windows is OK, it's just an array view on the original data. Computing persistent homology is the bottleneck, but not so much if each sliding window does not contain too many subsequences. The backends are in C++ and they are hugely efficient considering how sophisticated the calculations are; still, there is a fundamental combinatorial explosion when computing degree 1 and above. Perhaps it is important to note that these "persistent homology" algorithms are difficult to parallelize fully and, as of yet, only partially parallelizable on say a GPU.

Yes! The choice of d is a major issue and often requires some level of domain expertise. Otherwise, there is some research or computing something called a pan-matrix-profile

Thanks for the link!

Btw, this is a wonderful conversation and it is really challenging my brain! Thank you for taking the time to discuss, engage, and teaching me something new!

Happy to keep it going!

seanlaw commented 3 years ago

Thanks, @ulupo. I will take some time to digest your writing and respond with my questions.

seanlaw commented 3 years ago

I just realized that I had missed your notation where, if I understand correctly, you've designated lowercase x_t as an individual point and uppercase X_t as a vector of points. I think that I have most of the picture but to help me better understand, I think it would would be useful to start with a real (simple) illustrative example. Let's say we have a time series, T:

T = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]
n = len(T)  # n = 13
m = 4  # Sliding window size

In matrix profile terminology, a "subsequence" is simply "any shorter part of the full time series with length m". Let's say that we choose m = 4, then [0, 1, 3, 2] is a subsequence or [14, 15, 1, 2] can be a valid subsequence both of length m = 4. In fact, all of the (sliding window) subsequences of length m = 4 are:

[[ 0,  1,  3,  2],
 [ 1,  3,  2,  9],
 [ 3,  2,  9,  1],
 [ 2,  9,  1, 14],
 [ 9,  1, 14, 15],
 [ 1, 14, 15,  1],
 [14, 15,  1,  2],
 [15,  1,  2,  2],
 [ 1,  2,  2, 10],
 [ 2,  2, 10,  7]]

And, in the most un-optimized/inefficient way (i.e., this approach doesn't scale but serves to explain what STUMPY is doing), we can compute the matrix profile in a few lines of code:

import math

matrix_profile = [math.inf] * (n-m+1)

for i in range(n-m+1):
    for j in range(n-m+1):

        # Compute Euclidean distance between T[i : i + m] and T[j : j + m]
        distance = 0
        for k in range(m):
            distance += (T[i + k] - T[j + k])**2
        distance = math.sqrt(distance)

        # Update matrix profile 
        if distance < matrix_profile[i]:
            matrix_profile[i] = distance

print(matrix_profile)

So, the biggest (unresolved) problem that we have is choosing the "best" subsequence window size m. If you choose an m that is too small then you might miss a pattern (if one exists). It is also possible that there are different patterns that exist in different values of m.

ulupo commented 3 years ago

Thanks for the nice explanation @seanlaw! I think it confirms my previous understanding. The geometric interpretation that I have been selling above comes from thinking of the sequences T[i : i + m] as vectors in an m-dimensional space. Then you are effectively finding the nearest neighbor to each such vector and recording its distance to the latter. Clusters of similar vectors (and, hence, of similar patterns in the original time series) could be found by clustering the resulting cloud of vectors. This is basically my second bullet point from a couple of messages back. I say "hierarchical clustering" there because it is philosophically closest to what TDA would do with these clouds of vectors, but in fact any clustering algorithm would do.

As for the choice of m (and of a stride parameter s so that each vector would be T[i : i + m : s]), yes I would be interested to see if our heuristics can be any good in your typical workflows and analytics.

seanlaw commented 3 years ago

Thanks for the nice explanation @seanlaw! I think it confirms my previous understanding. The geometric interpretation that I have been selling above comes from thinking of the sequences T[i : i + m] as vectors in an m-dimensional space. Then you are effectively finding the nearest neighbor to each such vector and recording its distance to the latter.

Great! Glad that we are on the same page.

Clusters of similar vectors (and, hence, of similar patterns in the original time series) could be found by clustering the resulting cloud of vectors. This is basically my second bullet point from a couple of messages back. I say "hierarchical clustering" there because it is philosophically closest to what TDA would do with these clouds of vectors, but in fact any clustering algorithm would do.

I believe that the original authors think of it the same way as well (see slides 35-40)! In the matrix profile case, it's essentially a brute-force search for each nearest neighbor (i.e., it is exact and not approximate). As you can imagine, even for a single value of m (the embedding dimension), for a sufficiently long time series (i.e., 10,000,000 - 100,000,000+ data points) it can take two weeks to compute the matrix profile. Considering that TDA is performing some form of clustering, I presume that some pairwise distances are being computed and so I am really curious where the speed gains come from (maybe there is some approximation techniques?) and what time series length TDA can handle?

I would like to focus a bit of time to better understand the resources that you had provided above (re: "mutual information", "false nearest neighbors", and "takens embedding"). Would you mind giving me some directions as to:

  1. How best to start diving into these resources (which order I should start reading)
  2. Which giotto-tda examples/tutorials can I start looking at so that I can apply it to some times series with "known values of m"

I am thinking that I could test things out on a few known time series, share the analysis results with you, and, if things look positive/interesting, we might be able to co-author a short manuscript (the order of the author is unimportant to me so I am open to whatever works for you) and potentially submit it somewhere? Absolutely no pressure though as I'm sure you're busy. I just really appreciate your expertise/perspectives and your idea of applying a new approach to tackling this problem! I usually prefer transparency for OSS development but let me know if you are interested and if it makes any sense to create a private repo to track the progress/results. Again, just talking out loud and absolutely no pressure!

As for the choice of m (and of a stride parameter s so that each vector would be T[i : i + m : s]), yes I would be interested to see if our heuristics can be any good in your typical workflows and analytics.

This is so interesting. The concept of using a "stride" is the same as "downsampling" the time series for us. In fact, this is something that I recommend people do if their time series is long and they need to "try out" different m but I can appreciate having a heuristic way to select both the "time delay" and the "embedding dimension".

seanlaw commented 3 years ago

Oh, one additional note. In matrix profiles, we typically compute "z-normalized Euclidean distances" (i.e., for each pair of subsequences, we z-normalize each subsequence first (i.e., for each value within a subsequence we subtract the subsequence mean and divide by the subsequence stddev) before computing the Euclidean distance) rather than straight Euclidean distance. Typically, this is z-normalization is performed at run-time since it is usually impossible to pre-compute all of the z-normalized subsequences. I thought that I'd mention this in case that was an issue for the TDA algorithms. As I am going over the "mutual information" part, it is unclear how to do this for z-normalized subsequences.

EntAnalyst commented 4 months ago

This is a brilliant conversation, let me know if there's been any more progress :)