patrickzib / SFA

Scalable Time Series Data Analytics
GNU General Public License v3.0
309 stars 67 forks source link

getEuclideanDistance should depends on the state of timeseries #28

Open assaad opened 5 years ago

assaad commented 5 years ago

The getEuclidean Distance gives incorrect results. It should depends whether the timeseries is normalized or not and whether the query is normalized or not. In case of normalization, both should be de-normalized (value * std + avg). The current implementation normalize again !

As well, for the epsilon radius search, std needs to be taken into account Cheers! Assaad

patrickzib commented 5 years ago

Thank you for pointing this out.

There are two cases:

1) the query is z-normed: thus we have to z-norm all the subsequences.

2) query is not z-normed: a) we can either force the query to be z-normed, which is one call to query.norm() before performing the query or b) we can not apply z-normalization at all, which would be a check in getDistance.

Typically, we force the query to be z-normed in this case. However, I agree that it is possible to leave out z-norm, too. The proposed pull-request however, breaks the code by removing z-norm for subsequences.

assaad commented 5 years ago

For my understanding, it's better to return the un-normalized distance measure. So a potential fix will be: if the query is normalized -> unnormalize its values, if the timeseries is normalized -> un-normalize the values -> then calculate the euclidean distance between original un-normalized versions of both timeseries data and query :) Big thanks anyway!

patrickzib commented 5 years ago

Removing it, is not sooo easy. For example, SFA uses the momentary Fourier transform. The MFT applies z-normalization to each subsequence: https://github.com/patrickzib/SFA/blob/43cfad8696591f4a526ff8372fbbbfee161d215b/src/main/java/sfa/transformation/MFT.java#L112

The same does TimeSeries.getDisjointSequences and TimeSeries.getSubsequences

So, when removing it from getDistance, you suddenly try to lower bound un-normalized subsequences with z-normalized words. This should fail.

assaad commented 5 years ago

yes i know, that's why i didn't propose a fix. I saw that in subsequence matching, each subsequence is normalized with different mean and std, so i can't un-normalized. maybe we need 2 separate methods.