tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.92k stars 342 forks source link

[BUG] non-conformance of `metrics.lcss` with input interface expectations (3D numpy) #483

Open fkiraly opened 1 year ago

fkiraly commented 1 year ago

It seems that metrics.lcss is non-conformant with generic interface expectations towards time series distances when passing 3D np.ndarrays.

It would seem that lcss fails on some, or all 3D np.ndarrays. This behaviour was observed when running an sktime conformant adapter through the sktime test suite, here: https://github.com/sktime/sktime/issues/5367#issuecomment-1750378133

Update: the issue seems to be that, unlike all other metrics, lcss does not accept 3D numpy as input.

A secondary issue is the lack of easily findable documentation on the input format of the "time series" in the various metrics. Unfortunately, metrics with the same docstring can have different input conventions (supports 3D or not).

YannCabanes commented 1 year ago

Hello @fkiraly, I have discussed this issue with @rtavenar: it is not a bug and it corresponds to the documentation (which we will make clearer).

The functions taking 3D arrays as inputs are the functions: cdist_* For example in the docstring of the function cdist_dtw, there is the input parameter:

dataset1 : array-like
    A dataset of time series

Functions of the form cdist_* can also accept 1D or 2D arrays when they use: dataset1 = to_time_series_dataset(dataset1)

The other metric functions usually take 2D arrays as inputs. For example, in the docstring of the function lcss, there is the input parameter:

s1
    A time series.

By default time series should be 2D arrays, some metric functions also accept 1D arrays when they use: s1 = to_time_series(s1)

To make it cleared, I will specify in the docstrings of the functions the possible shapes of the input arrays in the PR https://github.com/tslearn-team/tslearn/pull/486.

For example:

s1 : array-like, shape=(sz1, d) or (sz1,)
    A time series. If shape is (sz1,), the time series is assumed to be univariate.

Or:

dataset1 : array-like, shape=(n_ts, sz, d) or (n_ts, sz) or (sz,)
    A dataset of time series. If shape is (n_ts, sz), the dataset is composed of univariate time series. 
    If shape is (sz,), the dataset is a composed of a unique univariate time series.
fkiraly commented 1 year ago

I see, thanks for clarifying. The lack of specificity in the docstrings did cause some headache and trial-and-error on our side...

I would suggest in the docstrings to state clearly, in all 3 cases (1D, 2D, 3D), which index corresponds to instance index, time index, variable index.

Feel free to close this issue then.

PS: if I were to design the interface, I would add a default upcast to 3D to every distance function, and have only one single interface point. For deprecation period, you can add 3D to the 2D functions and a warning. Why: simpler for the user to learn just one generic interface, simpler to maintain and test (just loop over the entire list of distances)