matrix-profile-foundation / matrixprofile

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
https://matrixprofile.org
Apache License 2.0
362 stars 62 forks source link

Understanding/Use Case: Clustering of repeated patterns #34

Open NimaSarajpoor opened 4 years ago

NimaSarajpoor commented 4 years ago

Hi,

I recently found sth strange while using a matrix profile.

I used the data (1d-array: 8760 hourly data between 0 and 1, so not normalized) saved into this CSV file: https://drive.google.com/file/d/1dZgaEZjD2HsnXu_wWaGKPCPRbDJ1_dty/view?usp=sharing

And here is my simple code:

`

mydata: read from csv file (with length: 8760)

H=24 profile = mp.compute(mydata, windows=[H]) ' image

' ind_min = np.argmin(profile['mp']) ind_min_pair = profile['pi'][ind_min]

print('min of dist [best match of all]', profile['mp'][ind_min]) plt.plot(mydata[ind_min:ind_min+H]) plt.plot(mydata[ind_min_pair : ind_min_pair+H]) plt.show() `

So, the goal was to find the closest pair of sequences (with length H=24). And here is the result:

image

It is noteworthy to mention that I also checked out the plot for the first few indices (the matches for index from 0 to 20) and the plots were good. However, when I tried to get the best match of all, I got the result attached above!

The same problem exists for these two pair of indices as well: (44,424) and (46,426), with min dist=0 or (63,425):

image

Am I missing sth here? Isn't MP 1-dim array is: np.min(distance_matrix,axis=1)? (where distance_matrix[i,j] is the Euclidean distance between ts[i,i+H] and ts[j,j+H]) ???

If I understand correctly, mp calculated is the Euclidean Distance between normalized sequences. However, mydata[421:421+H] is all zeros. Do you know how the algorithm treats a situation where x (one of the subsequences in the time series) is all zeros: x_norm = (x-0) / 0 ???

############################################################################################## I tried to figure out how muinvn function works but I couldn't.

tylerwmarrs commented 4 years ago

@Ninimama would you mind mentioning what the issue was? I imagine that you resolved the issue as you closed it.

NimaSarajpoor commented 4 years ago

Actually I didn't resolve the issue. But I couldn't understand whether it is a bug or not.

My main concern is: Does MatrixProfile find the best matches based on the normalized subsequences? [z-normalized Euclidean distance as mentioned in the Matrix Profile I paper] Because I realized that profile['mp'] is the distance between normalized subsequences.

If yes, it means it doesn't take into account the actual magnitude in the patterns. Is that right? So, if I want to find the best matches in my case where actual magnitude matters, it doesn't work for me.

Is there any way to add an argument that can give the user the ability to choose whether to use normalized ED or not (at the cost of higher computation time)?

NimaSarajpoor commented 4 years ago

NOTE: ts_i = ts[i:i+H], where H is the window 24.

My data (accessible via the provided link in the previous post) is a time series with length 8760. the values are between 0 and 1. In my data, there is a range where all values are zeros: mydata[421:450].

According to the result: profile['pi'][45] = 425 I attached the plot of ts_45 and ts_425 in the previous post. And, it seems a little bit weird to me such a result.

Then I realized that: euclidean_dist(ts_45,ts_425) > euclidean_dist(ts_45,ts_7229)

But, since MatrixProfile uses z-normalized Euclidean Distance, we'll get a different result. I am not sure if this is just for the case where a subsequence is all zeros or such problem can appear in other cases as well since comparing distances of normalized subsequences is different that the comparison of the subsequences themselves.

tylerwmarrs commented 4 years ago

The matrix profile uses z-normalization, but it also compares sequences in a sliding window fashion. Magnitude should be a factor here. There is a known issue with the formulation when there is a constant value over the window size causing nans and infs. Generally, adding some random noise to your time-series samples solves this issue.

Will you please respond to/attempt the following:

  1. Are you trying to identify anomalies or repeated patterns?
  2. Try adding random noise to your time series. This can be random to the degree of 1e-7
  3. Try using the motif/discord discover algorithms. You appear to be using custom implementations. There may be a disconnect there. You can even simply use this:
mp.analyze(time_series, windows=H)
NimaSarajpoor commented 4 years ago

Thanks for the clarification.

In fact, I am at the exploration stage, and yes, I am trying to find some sort of repeated patterns. I am trying to see if it makes sense to cluster 365 daily sequences of this 8760-hr data which can be seen in a 365 days-by-24 hours format. Because, according to MatrixProfile, the closest pattern to each daily sequence is not necessarily start at the beginning of a day at 00:00 hr. So, I am trying to see if MatrixProfile can help me better understand the data and if I can come up with a clustering approach by using such information.

tylerwmarrs commented 4 years ago

Because, according to MatrixProfile, the closest pattern to each daily sequence is not necessarily start at the beginning of a day at 00:00 hr.

Yes, it will not always be exactly at the start/end of a day. It primarily depends on the data.

If I understand you correctly, you want to simply cluster many time series together. In this case, you should consider using clustering with MPDist. Coincidentally, I wrote a few code tutorials on how to use MPDist with various clustering algorithms. HDBScan + MPDist is one of my go-to methods right now.

Please see the following:

  1. https://matrixprofile.org/posts/clustering-computing-the-pairwise-distance-matrix/
  2. https://matrixprofile.org/posts/hdbscan-clustering-with-mpdist/
  3. https://matrixprofile.docs.matrixprofile.org/examples/Hierarchical_Clustering_Accelerometer_Walk_Stand_etc.html

Is this closer to what you are trying to do as an end result?

NimaSarajpoor commented 4 years ago

That's correct. I want to cluster many time series together. But the problem with MPDist is that it matches sequences according to their similarity of subsequences regardless of their order which is not desirable for my case. So, I am looking for sth like the one mentioned in this paper (section 3.9): https://www.cs.ucr.edu/~eamonn/Top_Ten_Things_Matrix_Profile.pdf

I think it is different than the MPDist use cases.

NimaSarajpoor commented 4 years ago

It seems the link I provided isn't working when I clicked on it. So, here is the name of the paper:

The Swiss Army Knife of Time Series Data Mining: Ten Useful Things you can do with the Matrix Profile and Ten Lines of Code

tylerwmarrs commented 4 years ago

I understand where you are heading now. You basically tried this approach?

image

If order matters and you are trying to handle warping windows, dynamic time warping may be a good approach. From my experience when you think order matters and that MPDist may not give desirable results, it actually does. Since you are the domain expert, you may be interested in DTW with LB_Keogh:

https://github.com/tylerwmarrs/dtw-lbkeogh-clustering

NimaSarajpoor commented 4 years ago

In fact, I tried DTW and some of its variants. But I wanted to see if MatrixProfile can bring sth new to the picture. I will check out MSMD. So, if I understand correctly, what you are saying is that there might be a chance that MPDist gives some sort of interesting result. Am I right? So, I will give it a shot.

Thanks for sharing your viewpoints on this matter.

NimaSarajpoor commented 4 years ago

[Update] @tylerwmarrs I tried to use MPDist by getting help from the tutorial.

Here is the code:

` MPDist_vecform = pairwise_dist(mydata, window_size=8) MPDist_matform = squareform(MPDist_vecform)

` And, I use Scikit Learn Hierarchical Clustering by using the obtained distance matrix with metric='precomputed'. However, according to the result, it seems that MPDist ignores the actual value of time series sequences and probably z-normalize them prior to clustering. For instance, it puts the following two time-series into the same group:

image

peterdhansen commented 4 years ago

Would it work to use MPDist for clustering and then sort/recluster them by euclidean distance?

tylerwmarrs commented 4 years ago

@Ninimama please see the attached PDF where I add some random noise to your time series. I think this will solve some of the issues you are seeing. I took your toy dataset and noticed there are many cases where there are constant values. This tends to impact the results. Adding the noise is a workaround.

Let me know if this looks more desirable.

Nima Github Issue.pdf

NimaSarajpoor commented 4 years ago

Hi tylerwarrs,

I was finishing my paper and studying git (for my own sake and github)

Recently, I did a pull to change snippets, but after checking out the logs, I realize another person (Yui Lu) already modified the algorithm (Please note that the docstring needs to be updated accordingly since a new key "neighbor" is added to the dictionary)

=========================================

Thanks for the PDF. In fact, I was trying to see if I can extend the material/concept you provided but I couldn't. Although it is good for extracting similar patters, it doesn't cover the whole sequence. So, my goal is to see if I can extend MP to cluster and cover the whole dataset by finding similar patterns (and maybe allowing some number of elements to be outliers.) I will start to think about such an algorithm after finalizing my current work.

I haven't used the MatrixProfile concept yet in my current work and I am working in parallel! Unfortunately, I am not a good multi-tasking guy!