johannfaouzi / pyts

A Python package for time series classification
https://pyts.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.76k stars 163 forks source link

add `n_init` argument to KMeans to supress warning #143

Closed valcarcexyz closed 1 year ago

valcarcexyz commented 1 year ago

Preserve the same behaviour, but add the n_init argument to the sklearn KMeans function to prevent this warning.

johannfaouzi commented 1 year ago

Thank you for your PR! Just some quick thoughts:

I would be glad to have your opinion on this! Anyway, I will probably still merge your PR so that the warning is suppressed and the old behavior is also the current behavior.

valcarcexyz commented 1 year ago
johannfaouzi commented 1 year ago

The issue is that this change in scikit-learn is not backwards compatible, meaning that different scikit-learn versions (with the same pyts version) may lead to different results. So we have to decide if this behavior is acceptable or not.

I will try to run KMeans on several datasets with 1 and 10 initializations and see:

valcarcexyz commented 1 year ago

I am working with a classification problem in my job, so as soon as I finish my workday, I will try to run the problem to get the metrics you are asking for

valcarcexyz commented 1 year ago

For my task, I did not see any score improvement with this model, but the n_init has a performance impact. It is up to you to decide what you believe it is the best. What I do not understand is that when it is set to auto it should perform as if it was set to 10, but does not seem like it does; can you confirm this behaviour? The results were obtained by cross validation.

output

johannfaouzi commented 1 year ago

Thank you very much for your experiments! I didn't have time to do mine yet, but:

valcarcexyz commented 1 year ago

I have set the the random seed and the results are the same as the previous plot.

johannfaouzi commented 1 year ago

I just tried on a toy dataset (GunPoint, available in pyts) and I get the exact same shapelets for:

I'm also surprised with the score time plot. KMeans only impacts the initialization of the shapelets, thus the training time. Even if the shapelets are different, the number of shapelets should be identical, and the difference in score time should only come from double precision compute, and thus be very low.

valcarcexyz commented 1 year ago

Ok, I can confirm your results. I was working with a multivariate dataset, as soon as I tried to use a single variable one, I got the following results: output

I believe that score_time can be ignored due to its magnitude size.

I need to read the source code better to understand what is happening when using the MultivariateClassifier class.

johannfaouzi commented 1 year ago

I just checked with a mutlivariate dataset (basic motions, available in pyts), and I still get the exact same shapelets for ('warn' and 10 in one hand) and ('auto' and 1 in the other hand), as expected. I used pyts.multivariate.classification.MultivariateClassifier to wrap up the base classifier (with random_state being still set to a fixed integer).

pyts.multivariate.classification.MultivariateClassifier independently fits a classifier for each feature. If only one base univariate classifier is provided, it is cloned so that there is a univariate classifier for each feature afterwards. It does do much.