tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 342 forks source link

Global alignment kernel returns NaN for all timeseries #510

Open NAThompson opened 8 months ago

NAThompson commented 8 months ago

Describe the bug

gak(x,y) returns nan for all x,y.

To Reproduce

import random
import numpy
from math import pi as π
from tslearn.metrics import sigma_gak, gak

def test_reproduce():
    f0 = 20e9
    period = 1/f0
    ω0 = 2*π*f0
    # works at length 405; dies at 410:
    waveforms = numpy.empty(shape=(5, 410))
    times = numpy.linspace(-period/2, period/2, waveforms.shape[1])
    for i in range(waveforms.shape[0]):
        φ = random.gauss(0.0, 0.5)
        k = 3 + random.uniform(-0.5, 0.5)
        waveforms[i, :] = 0.5*(numpy.tanh(k*ω0*times + φ) + 1) + random.uniform(-0.05, 0.05)

    typical_values = 0.5*(numpy.tanh(3*ω0*times) + 1)
    σ = sigma_gak(typical_values)
    scores = numpy.empty(waveforms.shape[0])
    for i in range(waveforms.shape[0]):
        scores[i] = gak(typical_values, waveforms[i, :], sigma=σ)

    print(scores)
    print(gak(waveforms[0, :], waveforms[0, :], sigma=σ))

Expected behavior

The computation should not return NaN; maybe it needs to be stabilized with the log-sum-exp method?

Environment (please complete the following information):

YannCabanes commented 8 months ago

Hello @NAThompson, Thank you for your issue. Indeed, there is an overflow which can occur if the function gak is run on long time series which have very few variations. We have: gak(s1, s2) = unnormalized_gak(s1, s2) / sqrt(unnormalized_gak(s1, s1) * unnormalized_gak(s2, s2)) For example, if s1 is a constant time series (worst possible scenario) of size n, the function unnormalized_gak(s1, s1) will return the number of possible warpings to match s1 with itself, each warping being optimal. This number is equal to the number of paths in a grid of n rows and n columns to go from the top left cell to the bottom right cell when the only possible moves are going right, down or diagonally (right and down). When s1 is a constant time series, unnormalized_gak(s1, s1) returns a float when n <= 405 an returns inf when n >= 406. If s1 and s2 are both constant time series of size n, the product unnormalized_gak(s1, s1) * unnormalized_gak(s2, s2) returns a float when n <= 204 and returns inf when n >= 205.