lfereman commented 8 years ago

Hi Nathan,

I was trying to use your implementation, but I guess it contains some bugs, as far as I can figure it out.

The problem:

I have a sax_sequence A, A: aaaaccccbbaa

and a longer sequence "sequence" C: match score between subsequence, and A ^ indexes | ^ |> subsequence in C W: 0.000 (0, 24) bbbbbbbbbbbb W: 0.000 (2, 26) bbbbbbbbbbbb W: 0.000 (18, 42) aaaaccbbbbba W: 0.000 (48, 72) aaaaccccbbaa W: 0.000 (70, 94) bbbbbbbbbbbb W: 0.000 (72, 96) bbbbbbbbbbbb W: 0.000 (74, 98) bbbbbbbbbbbb W: 0.000 (76, 100) bbbbbbbbbbbb

It hink it's a bug that bbbbbbbbbbbb is equal to aaaaccccbbaa, no? The problem is that compareDict does not make sense (e.g. difference(a,b)=0 and difference(b,c)=0) e.g. print s.compareDict {'aa': 0, 'ac': 0.86, 'ab': 0, 'ba': 0, 'bb': 0, 'bc': 0, 'cc': 0, 'cb': 0, 'ca': 0.86}

Full case that triggers bug:

sequence = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 23.0, 73.0, 73.0, 75.0, 30.0, 16.0, 19.0, 27.0, 33.0, 19.0, 5.0, 20.0, 19.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 18.0, 16.0, 11.0, 30.0, 10.0, 39.0, 12.0, 2.0, 15.0, 16.0, 4.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 19.0, 6.0, 39.0, 27.0, 18.0, 20.0, 38.0, 34.0, 33.0, 10.0, 10.0, 15.0, 10.0, 8.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 9.0, 10.0, 10.0, 35.0, 25.0, 24.0, 18.0, 28.0, 18.0, 16.0, 18.0, 31.0, 10.0, 10.0, 15.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 8.0, 30.0, 25.0, 13.0, 13.0, 28.0, 27.0, 20.0, 13.0, 9.0, 11.0, 5.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 11.0, 4.0, 18.0, 26.0, 13.0, 23.0, 16.0, 13.0, 15.0, 12.0, 17.0, 15.0, 24.0, 9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 10.0, 9.0, 3.0, 27.0, 15.0, 18.0, 23.0, 25.0, 16.0, 12.0, 23.0, 13.0, 16.0, 10.0, 8.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 13.0, 8.0, 28.0, 25.0, 19.0, 15.0, 23.0, 8.0, 23.0, 30.0, 28.0, 20.0, 25.0, 16.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 16.0, 10.0, 27.0, 24.0, 30.0, 27.0, 28.0, 41.0, 31.0, 25.0, 6.0, 25.0, 9.0, 9.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.0, 11.0, 25.0, 28.0, 15.0, 15.0, 23.0, 15.0, 23.0, 26.0, 15.0, 17.0, 12.0, 9.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 18.0, 12.0, 36.0, 28.0, 13.0, 21.0, 15.0, 19.0, 33.0, 36.0, 9.0, 6.0, 10.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 12.0, 42.0, 13.0, 23.0, 23.0, 49.0, 5.0, 6.0, 15.0, 13.0, 13.0, 11.0, 16.0, 2.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 2.0, 16.0, 25.0, 17.0, 16.0, 25.0, 18.0, 18.0, 25.0, 17.0, 13.0, 12.0, 4.0, 4.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 11.0, 3.0, 28.0, 20.0, 24.0, 21.0, 21.0, 21.0, 16.0, 32.0, 28.0, 15.0, 18.0, 15.0, 2.0, 11.0, 23.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 17.0, 13.0, 26.0, 15.0, 18.0, 15.0, 3.0, 0.0, 11.0, 19.0, 11.0, 17.0, 12.0, 4.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 18.0, 16.0, 26.0, 15.0, 19.0, 18.0, 20.0, 26.0, 11.0, 12.0, 10.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 12.0, 10.0, 45.0, 20.0, 15.0, 28.0, 20.0, 24.0, 16.0, 19.0, 20.0, 13.0, 19.0, 15.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 9.0, 4.0, 11.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 2.0, 5.0, 1.0, 15.0, 8.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 15.0, 12.0, 21.0, 25.0, 15.0, 15.0, 26.0, 2.0, 0.0, 2.0, 0.0, 4.0, 12.0, 16.0, 18.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 10.0, 12.0, 6.0, 20.0, 0.0, 0.0, 1.0, 27.0, 19.0, 25.0, 3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 24.0, 11.0, 25.0, 17.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 26.0, 26.0, 17.0, 6.0, 18.0, 17.0, 8.0, 17.0, 4.0, 21.0, 12.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 0.0, 6.0, 5.0, 16.0, 18.0, 23.0, 32.0, 17.0, 25.0, 5.0, 12.0, 13.0, 0.0, 0.0, 0.0]

e.g. input assumed to be 30 days of each 24 points

motif = sequence[48:72]

s = SAX(12, 3) (a_sax, a_indexes) = s.to_letter_rep(motif) print "a_sax: %s" % a_sax (sequence_strings, sequence_indexes) = s.sliding_window(sequence, len(sequence)/ len(motif)) x3x2ComparisonScores = s.batch_compare(sequence_strings,asax)

count = 0.0 threshold = 0.1 print "A:\t\t_\t%s" % a_sax for i, score in enumerate(x3x2ComparisonScores): if score<threshold: print "W: %.3f\t%s\t%s" % (score, sequence_indexes[i], sequence_strings[i])

Other questions:

Your input parameters are chosen somewhat strange in my opinion. I would expect to just pass, the gap_size in sliding_window and the size of interval for PAA (e.g. 2 for taking averages of two number in input sequence) in the SAX constructor.

I'm not sure if you still maintain this code, but it would be nice to have a correct implementation in python! However, if you choose not to maintain this code, I think it's better that you remove it as it does not seem to work properly...

Kind regards! Len Feremans

nphoff commented 8 years ago

Hi Len! Thanks for your interest in the project!

As to your first question:

The scoring of differences between strings begins when letters are more than one letter away. Meaning it will score a to a the same as b to a. That's is how the algorithm for comparison was developed, and I kept that to spec, though I agree it is a little strange. I would suggest using a dictionary size larger than 3.

As to your other questions:

The parameter order is definitely something to think about!
- I'm not sure what you mean by gap size. The parameter which is present is the overlapping fraction (meaning how much of one window overlaps with the next window). I think it made sense to me to be able to customize that overlap on a per-comparison basis, but I haven't thought about this in a while.
- I'm not sure what you mean by size of interval for PAA, either. Do you mean like...how many data points from the original are reduced to a single letter? I think the original thought behind it was that all the strings had to be the same length anyway, and it was easier to specify that in the constructor.
Happy to revisit those if you have thoughts.

Nathan

zangsir commented 8 years ago

Hi lfereman,

There is a simple way you can test the correctness of this implementation on your test cases. For example, I downloaded the original implementation in MatLab from the authors' (who invented SAX) page (http://www.cs.ucr.edu/~eamonn/SAX.htm) from the part they mentioned Li Wei's code. And I ran your test cases, such as bbbbbbbbbbbb and aaaaccccbbaa, and the distance is indeed 0, as defined in the original paper and explained by Nathan above.

To Nathan: thanks for the code, I've been using it, although when mining large dataset, computing pairwise distance using MIN_DIST is quite time consuming. I wonder if you have thought of ways to optimize the code to make it more efficient?

-Shuo

nphoff commented 8 years ago

Hi Shuo, no, I haven't spent any time thinking about how to optimize that function, but it does seem interesting.

What size data sets are you working with, and what kind of times are we talking about?

zangsir commented 8 years ago

Hi Nathan,

Sorry for the delay. I am doing tasks involving high computational complexity by the algorithms themselves, so the SAX MIN_DIST computation is a key factor in affecting the speed. Two examples of these are k-Nearest Neighbor classification, and Query By Content (QBC, where let's say you have K random seeds of time series and you query in a large database of N time-series(maybe tens of thousands of time series subsequences), and it will return the top ranked M similar time-series.

Even though these are quite expensive tasks, doing them on Euclidean distance doesn't take long. Let's say I run a small example of QBC with K = 50, and N = 200, as illustrated above. The Euclidean distance computation I have implemented (including computing a Mean Average Precision score over all possible values of M from 1 to 50) will return a result within 2 seconds. However, using the saxpy implementation I forked from you, it will take 32 seconds. Considering N=200 here, this is really a small example and you can imagine it's hard to scale this to a large database of TS. (For KNN, it took almost an hour to run it once on a data set of 1600 time series).

I have re-implemented saxpy (saxpyFast as you can find in my fork of your saxpy) to speed it up, and the key is to use integer representation in MIN_DIST computation instead of converting to strings(also takes advantage of numpy efficient matrix operations that are similar to Matlab instead of using for loops). Using saxpyFast, my test results on the same experiment above runs only 9 seconds. There might be still room to improve but this is a start:)

Sorry, I didn't mean to say your implementation is not good, it's just a way to improve the speed. And with your string representation, it is also possible to use it on other string algorithms, so in that aspect it's just different from what I've done in saxpyFast. I used yours for a long time and I like it a lot. Thanks a lot for the contribution!

jimenofonseca commented 8 years ago

@zangsir hi, i am looking for your implementation of SAXFAST is there a way i can take a look? thanks!

zangsir commented 8 years ago

@JIMENOFONSECA Yeah, I have put saxpyFast in my fork of saxpy. In the meantime, you can also check out my repository SPM and you can find saxpyFast.py and test_saxFast_mindist.py in the ts_mining directory (https://github.com/zangsir/SPM/tree/master/ts_mining), the latter demos a complete usage of the saxpyFast (it is self-sufficient in terms of data and you can run it on its own, provided you have saxpyFast.py in the same directory). Note that this is mainly for the purpose of fast distance computation and will not give you a string representation per se (as this uses integer arrays as internal representations for the SAX strings).

nphoff commented 8 years ago

@zangsir Any way you could merge back in your changes to make saxpy faster? :D

zangsir commented 8 years ago

@JIMENOFONSECA It just occurred to me that I am yet to make better comments and documentation on saxpyFast, since a number of things work a little differently than the original saxpy. But essentially, notice that the class SAX takes one more argument, namely, windowSize. This parameter (mirroring the original Matlab implementation by Jessica Lin and Li Wei) exists to facilitate the extraction of SAX representation of subsequences of a long time-series by sliding a window of size windowSize through the entire long time-series. It will return a list of SAX symbols instead of one symbol. This just means you will need to keep this change in mind and use your python code to work around this data structure when doing further processing with it.
(if you don't want to extract subsequences, just put windowSize equals the length of the original long time-series)

(For example, if you have a time-series of length 1024 but you want to extract time-series subsequences of windowSize 16 at a time, so this will slide the window across the entire time-series and converts each subsequence of length 16 to a SAX representation of length wordSize. The SAX.to_letter_rep() method will return a list of SAX symbols, each corresponding to one subsequence extracted. The current hop size is 1 sample, meaning it will extract subsequences of length 16 beginning point 1, 2, ..., N-16+1).

zangsir commented 8 years ago

@nphoff Yeah, sure, I can clean up my code, add some documentation, and do a pull request. But since your implementation gives the strings and mine doesn't, maybe you can think of a way to keep both versions, so people can choose which one they want to use. I imagine if someone wants to use some string based algorithms on SAX strings, such as suffix trees or markov models, maybe it'll be more intuitive to use the string representation after all.

jimenofonseca commented 8 years ago

@zangsir i see the difference. I might implement the version of nphoff by now as i need the alphabet. many thanks for the clarification!.

@zangsir @nphoff btw I found a way to calculate the breaks or 'beta' based on some probability distributions of scipy. What i see quite handy if you do consider that the data between the alphabet is other than normally distributed. On the other hand you can also overcome the current restriction of maximum 20 breaks.

....just do:

Import scipy.stats as stats self.beta = list(stats.norm().ppf(np.linspace(0.01, 0.99, self.alphabetSize+1))[1:-1])

zangsir commented 8 years ago

@JIMENOFONSECA I didn't completely understand what you meant by your comment, but sure, having the ability to use some other distributions would be nice, I haven't considered that. I wonder have people published modified versions of SAX where they used non-Gaussian assumption? I'd love to read those.

jimenofonseca commented 8 years ago

@zangsir i edited the comment, maybe it is clear now?

zangsir commented 8 years ago

@JIMENOFONSECA cool, that is very clever, thanks!

jimenofonseca commented 8 years ago

@zangsir @nphoff have you tried to implement certain evaluation method to get an optimal wordsize and alphabet? Calinski.., Silhouette or any optimisation method?

zangsir commented 8 years ago

@JIMENOFONSECA I guess the optimal parameter values would be empirically determined and depend on the data set, as well as your application (such as clustering as you mentioned). Therefore it would be more flexible if someone can implement their own optimization to be more flexible. However, as you might have noticed the original paper on SAX by Lin et al. had a discussion on the tightness of the bound as a function of w and a, which maybe worth keeping in mind. I'd love to know your thoughts on this (the optimization, not the lower bound):)

jimenofonseca commented 8 years ago

I gues the bounds should depend on the data, that is totally right. About the optimization i think GA could work as it is a multi-objective problem (maximize sihluette and carlinski indicators). I found that some researchers used the harmony search algorithm in SAX with good enough results using the concept of entropy. Any thoughts?

On 6 Nov 2016, at 10:27, SHUO ZHANG notifications@github.com<mailto:notifications@github.com> wrote:

@JIMENOFONSECAhttps://github.com/JIMENOFONSECA I guess the optimal parameter values would be empirically determined and depend on the data set, as well as your application (such as clustering as you mentioned). Therefore it would be more flexible if someone can implement their own optimization to be more flexible. However, as you might have noticed the original paper on SAX by Lin et al. had a discussion on the tightness of the bound as a function of w and a, which maybe worth keeping in mind. I'd love to know your thoughts on this (the optimization, not the lower bound):)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/nphoff/saxpy/issues/2#issuecomment-258656169, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIjrggmPOz1sQThAn1HZ0WpF8pWYfgxKks5q7Tr6gaJpZM4KJ69g.

zangsir commented 7 years ago

@JIMENOFONSECA Thanks for the info (sorry for the late reply). In general I am not too familiar with the line of work for optimizing SAX, and it was definitely interesting to look at optimization based on entropy (maximizing information content) rather than aimed at classification accuracy. I will look at them and see if it is relevant to what I am doing. In the mean time, I wonder do you know have people in different communities (like time-series data mining or bioinformatic or other fields) have adopted some of these optimization process as a (quasi-) standard practice? and how crucial it is to do this in various applications? I guess that is more of my question.

nphoff / saxpy

Bug in implementation? #2

The problem:

Full case that triggers bug:

e.g. input assumed to be 30 days of each 24 points

Other questions: