Closed valcarcexyz closed 1 year ago
Thank you for your PR! Just some quick thoughts:
n_init
is 10, it will change to 'auto'
in version 1.4, and the concrete value will be 1 because the initialization is performed using 'kmeans++'
(default value for init
). So, if n_init
is set to 10
, it will not correspond to the default values of scikit-learn's KMeans starting in versions 1.4 and later. It is not necessary an issue, just something to keep in mind.kmeans_kwargs
argument for the class).I would be glad to have your opinion on this! Anyway, I will probably still merge your PR so that the warning is suppressed and the old behavior is also the current behavior.
n_init
set? Or should it expect the user to solve the warning?The issue is that this change in scikit-learn is not backwards compatible, meaning that different scikit-learn versions (with the same pyts version) may lead to different results. So we have to decide if this behavior is acceptable or not.
I will try to run KMeans on several datasets with 1 and 10 initializations and see:
I am working with a classification problem in my job, so as soon as I finish my workday, I will try to run the problem to get the metrics you are asking for
For my task, I did not see any score improvement with this model, but the n_init
has a performance impact. It is up to you to decide what you believe it is the best. What I do not understand is that when it is set to auto
it should perform as if it was set to 10, but does not seem like it does; can you confirm this behaviour? The results were obtained by cross validation.
Thank you very much for your experiments! I didn't have time to do mine yet, but:
'auto'
should be similar to 1
because the default value for init
is k-means++
(see the description of n_init
: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).random_state
parameter to an integer? It is used to control all the randomness in LearningShapelets
: initialization of the shapelets (using k-means) and initialization of the coefficients of the logistic regression model.I have set the the random seed and the results are the same as the previous plot.
I just tried on a toy dataset (GunPoint, available in pyts) and I get the exact same shapelets for:
n_init='warn'
and n_init=10
: this is normal because the previous default value of n_init
was 10 and the change will only occur in version 1.4 and later.n_init='auto'
and n_init=1
: this is normal because the default value for init
is k-means++
, which turns n_init='auto'
to n_init=1
internally as mentioned in the documentation.I'm also surprised with the score time plot. KMeans only impacts the initialization of the shapelets, thus the training time. Even if the shapelets are different, the number of shapelets should be identical, and the difference in score time should only come from double precision compute, and thus be very low.
Ok, I can confirm your results. I was working with a multivariate dataset, as soon as I tried to use a single variable one, I got the following results:
I believe that score_time
can be ignored due to its magnitude size.
I need to read the source code better to understand what is happening when using the MultivariateClassifier class.
I just checked with a mutlivariate dataset (basic motions, available in pyts), and I still get the exact same shapelets for ('warn'
and 10 in one hand) and ('auto' and 1 in the other hand), as expected. I used pyts.multivariate.classification.MultivariateClassifier
to wrap up the base classifier (with random_state
being still set to a fixed integer).
pyts.multivariate.classification.MultivariateClassifier
independently fits a classifier for each feature. If only one base univariate classifier is provided, it is cloned so that there is a univariate classifier for each feature afterwards. It does do much.
Preserve the same behaviour, but add the
n_init
argument to the sklearn KMeans function to prevent this warning.