Open karhohs opened 6 years ago
Like DTW, SDTW is not a distance. And as you noticed unlike DTW, SDTW(x, x) != 0
. So we prefer to call SDTW a discrepancy.
The reason it's not zero is because SDTW considers all possible alignments weighted by their probability under the Gibbs distribution (see paper) while DTW considers only the best one.
As is typically done for the GAKernel (http://marcocuturi.net/GA.html) one might want to normalize SDTW to give it that property SDTW_(x,y) := SDTW(x,y) - 1/2(SDTW(x,x)+SDTW(y,y))
.
@mblondel thank you for your insights. I looked at your paper and see that the discrepancy is the weighted sum of all possible alignments, which I believe is equation 1 in your paper. From the above cosine example, I observed that the discrepancy between a cosine and itself is greater than the discrepancy between a cosine and a slightly-shifted, scaled by 0.5 version of itself. Does this imply that alignments that account for scale are more penalizing than alignments that account for shifts? I believe I will self-normalize each of my time-series to their mean to help mitigate this bias. As @mblondel states in his paper, "Unlike the Euclidean distance, DTW can compare time series of variable size and is robust to shifts or dilatations across the time dimension." Perhaps DTW is not robust to scaling and amplitude changes of a time-series. Am I reading this all correctly?
Also, thanks for pointing out your normalization @marcocuturi. Can the normalization be performed before computing the SDTW discrepancy? Can I use the same equation to normalize the gradient of the alignment matrix if I wanted to use the normalized discrepancy as a fitting term?
FWIW, I used @marcocuturi suggestion and applied it to the cosines defined above. The results now resemble a distance metric:
t = np.arange(100)
X = np.cos(t).reshape(-1, 1)
Y = 0.5*np.cos(t + 1).reshape(-1, 1)
Z = 2.0*np.cos(t + 1).reshape(-1, 1)
Dxx = SquaredEuclidean(X, X)
Dxy = SquaredEuclidean(X, Y)
Dyy = SquaredEuclidean(Y, Y)
Dxz = SquaredEuclidean(X, Z)
Dzz = SquaredEuclidean(Z, Z)
sdtw_xx = SoftDTW(Dxx, gamma=1.0)
sdtw_xy = SoftDTW(Dxy, gamma=1.0)
sdtw_yy = SoftDTW(Dyy, gamma=1.0)
sdtw_xz = SoftDTW(Dxz, gamma=1.0)
sdtw_zz = SoftDTW(Dzz, gamma=1.0)
print(sdtw_xx.compute()-0.5*(sdtw_xx.compute()+sdtw_xx.compute()))
print(sdtw_xy.compute()-0.5*(sdtw_xx.compute()+sdtw_yy.compute()))
print(sdtw_xz.compute()-0.5*(sdtw_xx.compute()+sdtw_zz.compute()))
Here's the output:
0.0
16.233598135265396
57.79562607598451
Furthermore, the effect of scaling is more plainly seen after this normalization. If I define the Z cosine as 1.5*np.cos(t + 1).reshape(-1, 1)
instead of 2.0*np.cos(t + 1).reshape(-1, 1)
, then the normalized discrepancy is 15.768949868955872
and very close to the 0.5
scaled cosine.
I've been trying to cluster time-series data using soft-dtw and hierarchical clustering, but have been having a difficult time defining clusters that are obvious by eye. @marcocuturi 's normalization equation has helped, except that a dissimilarity metric seems to confound my clustering efforts since the very dissimilar scores appear to dominate the clustering. Towards this end I rescaled the normalized dissimilarity metric between (0,1] with the equation exp(-x)
. This appears to help, but I'm still falling short of expectations. What do you all think about this approach? Am I missing something obvious? I'm happy to share a gist of my jupyter notebook and some test data if this is helpful. Thanks!
I am curious. Do you observe the same behavior with unregularized / exact DTW? Hierarchical clustering doesn't need gradients so you could try it too.
SDTW really shines when using it for k-means clustering, since the centroids / barycenters can be learned. I haven't added k-means to the repository yet, though.
Another option is to use the soft-DTW as a kernel and plug it into kernel k-means.
The kernel is given by sdtw_kernel(x, y; gamma) := exp(-sdtw(x, y; gamma))
.
I believe this project supports SDTW-based k-means clustering:
Hello,
I wanted to build up some intuition about how the soft-dtw cost function works by playing around with some fake data. I started comparing scaled and shifted cosines and wanted to see what numbers soft-dtw would spit out. However, when I compared a cosine to itself I couldn't understand why I was getting a negative result. Wouldn't the cumulative alignment cost for signal to itself be zero? Could someone please help me understand why the result is negative?
Also, when I compared similarly shifted cosines, but one was scaled by 0.5 and the other scaled by 2.0, the discrepancies were very different from each other. Why does scaling have such a strong effect on the scoring? Should I be self-normalizing my data before using soft-dtw?
I tried the following:
I got the following output: