add `n_init` argument to KMeans to supress warning

johannfaouzi / pyts

A Python package for time series classification

https://pyts.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.76k stars 163 forks source link

add `n_init` argument to KMeans to supress warning #143

Closed valcarcexyz closed 1 year ago

valcarcexyz commented 1 year ago

Preserve the same behaviour, but add the n_init argument to the sklearn KMeans function to prevent this warning.

johannfaouzi commented 1 year ago

Thank you for your PR! Just some quick thoughts:

Although the current value for n_init is 10, it will change to 'auto' in version 1.4, and the concrete value will be 1 because the initialization is performed using 'kmeans++' (default value for init). So, if n_init is set to 10, it will not correspond to the default values of scikit-learn's KMeans starting in versions 1.4 and later. It is not necessary an issue, just something to keep in mind.
It would be interesting to see if the proposed default values for the hyper-parameters of KMeans work well in practice and if the values should be changeable by the users (e.g., by introducing a new kmeans_kwargs argument for the class).

I would be glad to have your opinion on this! Anyway, I will probably still merge your PR so that the warning is suppressed and the old behavior is also the current behavior.

valcarcexyz commented 1 year ago

Yes, I kinda agree on the fact that it may make more sense to stick to KMeans defaults, so we can expect the underlying library to always have a common behaviour.
Do you think that this new argument should have the n_init set? Or should it expect the user to solve the warning?

johannfaouzi commented 1 year ago

The issue is that this change in scikit-learn is not backwards compatible, meaning that different scikit-learn versions (with the same pyts version) may lead to different results. So we have to decide if this behavior is acceptable or not.

I will try to run KMeans on several datasets with 1 and 10 initializations and see:

how much time it takes,
how different the results are.

valcarcexyz commented 1 year ago

I am working with a classification problem in my job, so as soon as I finish my workday, I will try to run the problem to get the metrics you are asking for

valcarcexyz commented 1 year ago

For my task, I did not see any score improvement with this model, but the n_init has a performance impact. It is up to you to decide what you believe it is the best. What I do not understand is that when it is set to auto it should perform as if it was set to 10, but does not seem like it does; can you confirm this behaviour? The results were obtained by cross validation.

output

johannfaouzi commented 1 year ago

Thank you very much for your experiments! I didn't have time to do mine yet, but:

'auto' should be similar to 1 because the default value for init is k-means++ (see the description of n_init: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
Did you set the random_state parameter to an integer? It is used to control all the randomness in LearningShapelets: initialization of the shapelets (using k-means) and initialization of the coefficients of the logistic regression model.

valcarcexyz commented 1 year ago

I have set the the random seed and the results are the same as the previous plot.

johannfaouzi commented 1 year ago

I just tried on a toy dataset (GunPoint, available in pyts) and I get the exact same shapelets for:

n_init='warn' and n_init=10: this is normal because the previous default value of n_init was 10 and the change will only occur in version 1.4 and later.
n_init='auto' and n_init=1: this is normal because the default value for init is k-means++, which turns n_init='auto' to n_init=1 internally as mentioned in the documentation.

I'm also surprised with the score time plot. KMeans only impacts the initialization of the shapelets, thus the training time. Even if the shapelets are different, the number of shapelets should be identical, and the difference in score time should only come from double precision compute, and thus be very low.

valcarcexyz commented 1 year ago

Ok, I can confirm your results. I was working with a multivariate dataset, as soon as I tried to use a single variable one, I got the following results: output

I believe that score_time can be ignored due to its magnitude size.

I need to read the source code better to understand what is happening when using the MultivariateClassifier class.

johannfaouzi commented 1 year ago

I just checked with a mutlivariate dataset (basic motions, available in pyts), and I still get the exact same shapelets for ('warn' and 10 in one hand) and ('auto' and 1 in the other hand), as expected. I used pyts.multivariate.classification.MultivariateClassifier to wrap up the base classifier (with random_state being still set to a fixed integer).

pyts.multivariate.classification.MultivariateClassifier independently fits a classifier for each feature. If only one base univariate classifier is provided, it is cloned so that there is a univariate classifier for each feature afterwards. It does do much.