matrix-profile-foundation / matrixprofile

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
https://matrixprofile.org
Apache License 2.0
359 stars 62 forks source link

PMP compatibility with string data types. #73

Open AndrewWilkins84 opened 3 years ago

AndrewWilkins84 commented 3 years ago

In the white paper "Matrix Profile XX: Finding and Visualizing Time Series Motifs of All Lengths using the Matrix Profile", section 3 demonstrates the Pan Matrix Profile with string data, however this capability is not included in the current implementation.

This shouldn't be too hard to implement. The example looks for an identical string match, but Hamming distance (or some other distance metric) could be used measure similarity/dissimilarity between strings. Such a feature has the advantage of avoiding the use of Random Projection in order to find string motifs with noise included in the signal, while simultaneously searching all possible sequence lengths.

The alternative is Random projection through every motif sequence length (N/2 where N is the length of the time series) and all reasonable discrepancy counts.

tylerwmarrs commented 3 years ago

Based on our chat in discord, we could add a preprocessing function that allows you to convert string data to a numeric representation. There are some caveats to this approach as you may lose some meaning. There isn't really a nice way to provide a generic approach that works for all data sets.

@kavj mentioned this paper for reference:

It's just an inherent problem with treating strings as numeric data. You lose precision and sometimes a lot of it. Comparing character by character requires that each comparison is resolvable to equal or not equal. Once you have a range of values, you can encode arbitrary trends, and normalization tends to exacerbate this problem. See for example some earlier commentary on the issue.

The paper mentions a method called DFA, which isn't popular in practice as it tends to introduce edge artifacts and isn't that closely related, but the point is that they encounter the same issue in their analyses.

https://journals.aps.org/pre/abstract/10.1103/PhysRevE.49.1685

AndrewWilkins84 commented 3 years ago

I'd like to take a swing at this. @tylerwmarrs, I agree that conversion to numerical values may not be the best approach. I've done quite a bit of work already using SAX and Random Projection that uses string and character data types with no issue. As stated in the Discord channel, the distance is measured by Hamming distance (number of corrections needed to make 2 strings identical). This distance metric would have to be the default when working with string data types.

Beyond that, what I envision is a PMP that functions the same way as numerical, but the gradient is represented by the Hamming distance. This is different than the white paper "https://www.cs.ucr.edu/~eamonn/PAN_SKIMP%20%28Matrix%20Profile%20XX%29.pdf" in the sense that you no longer need an identical match (see Fig 4), but instead have the Hamming distance representation.

tylerwmarrs commented 3 years ago

Sounds good. Once you have some initial code in place, open a pull request and mention this issue. Dealing with string data is a new concept for this library, so we will need to make sure we fit it into the rest of the api accordingly.