Closed tliu68 closed 2 years ago
Let's call it GaussianMixtureIC
!
LassoLarsIC
is to LassoLars
as GaussianMixtureIC
is to GaussianMixture
Note: I think LassoLarsIC
is the only {model}IC
model selection estimator I see in Sklearn. Most use cross-validation instead.
(feature request issue to sklearn)
title: Gaussian Mixture with BIC/AIC
Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC
that does the job automatically.
Add a class (say GaussianMixtureIC
, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:
mclust
, see below)mclust
is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).
looks awesome! Linking their current demonstration is a good idea. I'd also add how many citations some of the main MCLUST papers have
looks awesome! Linking their current demonstration is a good idea. I'd also add how many citations some of the main MCLUST papers have
if it looks good, I'll post it
looks good to me
AutoGMM
results_
to be a dictIndeed, GridSearchCV
would be helpful for sweeping over the parameters. However, taking into account the number of parameters one can define and the process of computing initial parameters, we believe that the model selection for GMM may merit a new class. Here is one possible implementation:
reg_covar
): instead of using a single fixed regularization for all cases, we propose a dynamic regularization scheme where we start with 0 regularization and gradually increase it while checking for convergence until hitting a pre-defined maximum regularization.What do people think of adding a new class in general? And we are definitely open to other ways to implement it.
Just to briefly clarify the mclust algorithm (what we are proposing to implement here):
n_components
.GaussianMixtures
using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}
. As far as we can tell, the above isn't trivially accomplished with GridSearchCV
for a few reasons (some of which were already mentioned above, but just repeating here for clarity):
n_components
is not hard, but does take a bit of code.GaussianMixture
currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV
given all of the reasons above. While it wouldn't be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.
To clarify what we think is going on with the failing checks. One test check_clustering(readonly_memmap=True)
in the estimator checks triggered the following error at the line agg.fit(X_subset)
in our code:
ValueError: buffer source array is read-only
We observed that
AgglomerativeClustering(affinity="euclidean", linkage="single")
. Using the same random data in check_clustering
, we were able to recreate the error only calling AgglomerativeClustering
(as shown in the gist)AgglomerativeClustering
with the estimator checks is successful. However, in the check_clustering
test, AgglomerativeClustering
was initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward")
(also included in the gist). So this failing test is never actually run during the current sklearn test suite.
Also, those parameters are one set of the parameters for linkage and affinity our code sweeps over by default (and hence tested in the estimator checks), and indeed, they were not flagged in this test. We were wondering if the error had to do with using memmap backend data for AgglomerativeClustering
for some cases due to the following reasons:
check_clustering(readonly_memmap=False)
passedcheck_estimators_fit_returns_self(readonly_memmap=True)
but not for check_estimators_fit_returns_self(readonly_memmap=False)
X, y = create_memmap_backed_data([X, y])
Advice greatly appreciated!
To clarify what we think is going on with the failing checks: one test in the estimator checks check_clustering(readonly_memmap=True)
triggered the following error at the line agg.fit(X_subset)
in our code:
ValueError: buffer source array is read-only
We observed that:
AgglomerativeClustering(affinity="euclidean", linkage="single")
.AgglomerativeClustering
with the estimator checks is successful. However, in the check_clustering
test, AgglomerativeClustering
was only initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward")
(included in the gist). Therefore this failing test is never actually run during the current sklearn test suite. (affinity="euclidean", linkage="ward")
. If the current test for AgglomerativeClustering
also tested with these parameters, the test would also fail.X, y = create_memmap_backed_data([X, y])
So, this ultimately seems like a problem that only comes up in AgglomerativeClustering
with the combination of memap data and certain choices of affinity and linkage (such as "euclidean" and "single").
Advice on how to proceed is greatly appreciated!
@jovo @tliu68 i think just a slightly simplified version above that im happy with, thoughts?
@jovo @tliu68 i think just a slightly simplified version above that im happy with, thoughts?
I like it! I was worried about my updated one was too lengthy/wordy.
remove 'also'. otherwise, great!
We greatly appreciate your interest in our algorithm and your help in improving the code! However, we believe that scikit-learn
users might benefit more if the code were integrated into the main package instead of scikit-learn-extra
due to the following reasons:
scikit-learn
. Inspired by mclust
, our code also enables initialization with agglomerative clustering
, which is not yet an option in the default scikit-learn
model selection scheme for GMM.mclust
which is one of the most widely used clustering packages in R, downloaded about 110k times last month and cited by over 3600 publications (Banfield & Raftery, 1993; Scrucca et al., 2016).Given that our code could be used in either a basic or more advanced GMM model selection scheme, we believe that it could be of general interest to the scikit-learn
users. Thank you again for your consideration!
We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn
is the right place for this PR.
This PR is essentially a python port of R's mclust
, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering, and frequently updated since then {cite all the mclust papers, eg, this one is mclust5 with 1209 citations}, yielding a total of XXX citations over nearly 40 years. It is also one of the most popular R packages ever, with ~110k downloads just this month {provide citation for that}. And, it is also one of the 5 core clustering packages in R (https://cran.r-project.org/web/views/Cluster.html).
Thank you again for your consideration!
We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn
is the right place for this PR.
This PR is essentially a python port of R's mclust
, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering), and frequently updated since then (in 1998, 2002, 2003, 2006, 2012, 2014, and 2016), yielding a total of more than 11K citations over nearly 30 years. It is also one of the most popular R packages ever, with ~110k downloads just this month. And, it is also one of the 5 core clustering packages in R.
Thank you again for your consideration!
also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please
also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please
I think I have. It's the one in 2002. And I have added its citations. If it looks good, I'll post it?
also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please
I think I have. It's the one in 2002. And I have added its citations. If it looks good, I'll post it?
posted already!
merge AutoGMM (renaming to GaussianMixtureIC) into Sklearn by 2021 June