tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.85k stars 336 forks source link

Missing data #272

Open rtavenar opened 4 years ago

rtavenar commented 4 years ago

I received the following question by email*

Dear Romain

thanks for this toolkit. Can TSlearn handle missing data - quite a big problem in time series analysis of Earth Observation (EO) data ... my field ?

I am not 100% sure what is implied by "handle missing data", but I can try to formulate an answer:

*I can no longer answer the questions regarding tslearn by email, so please post your questions as a GitHub issue to maximize your chances of getting an answer

johannfaouzi commented 4 years ago

Missing data is not incompatible with variable-length time series. You can have a time series whose length is 80 with no missing data and another time series whose length is 60 with missing data. Toy example:

Capture d’écran 2020-07-02 à 09 25 34
GillesVandewiele commented 4 years ago

How is it not compatible? Can't you easily distinguish between "missing values" and "padding" by the location of the NaN? If it's at the end -> padding, in the middle -> missing value

johannfaouzi commented 4 years ago

I would advocate for two different values (maybe np.nan and np.inf) to highlight the difference. But as Romain said, there is no imputation module for the moment so NaN are just used for padding values.

I said not incompatible so I think that we agree on this ^^