merge AutoGMM into Sklearn

tliu68 commented 3 years ago

merge AutoGMM (renaming to GaussianMixtureIC) into Sklearn by 2021 June

bdpedigo commented 3 years ago

Let's call it GaussianMixtureIC!

bdpedigo commented 3 years ago

LassoLarsIC is to LassoLars as GaussianMixtureIC is to GaussianMixture

bdpedigo commented 3 years ago

Note: I think LassoLarsIC is the only {model}IC model selection estimator I see in Sklearn. Most use cross-validation instead.

tliu68 commented 3 years ago

(feature request issue to sklearn)

title: Gaussian Mixture with BIC/AIC

Describe the workflow you want to enable

Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC that does the job automatically.

Describe your proposed solution

Add a class (say GaussianMixtureIC, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:

Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
Covariance constraints
Number of components

Additional context

mclust is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).

bdpedigo commented 3 years ago

looks awesome! Linking their current demonstration is a good idea. I'd also add how many citations some of the main MCLUST papers have

tliu68 commented 3 years ago

looks awesome! Linking their current demonstration is a good idea. I'd also add how many citations some of the main MCLUST papers have

if it looks good, I'll post it

bdpedigo commented 3 years ago

looks good to me

tliu68 commented 3 years ago

modifications to `AutoGMM`

top priority

[x] clean up input checking
[x] add a function for input checking and call it after fit
[x] modify the inheritance
[x] lean up L381-398
[x] change results_ to be a dict
[x] clean up input testing
[x] modify docstring eg. add a small example
[x] modify the GMM model selection example in sklearn to use our algorithm
[x] check the code against contributing guidelines/checklist
[ ] ..

lower priority (under discussion)

[ ] clean up param_grid
[ ] rethink which step of the EM algorithm to start on
[ ] #306
[ ] ..

tliu68 commented 3 years ago

Indeed, GridSearchCV would be helpful for sweeping over the parameters. However, taking into account the number of parameters one can define and the process of computing initial parameters, we believe that the model selection for GMM may merit a new class. Here is one possible implementation:

initialization: in addition to random inits and kmeans, we would like to incorporate hierarchical agglomerative clustering to compute initial means, weights and precisions. Using AgglomerativeClustering is computational advantageous as @bdpedigo pointed out above(link).
model fitting:
1. cross over all clusterings options including covariance_type, n_components, and linkage and affinity for AgglomerativeClustering to fit the models
2. regularization (reg_covar): instead of using a single fixed regularization for all cases, we propose a dynamic regularization scheme where we start with 0 regularization and gradually increase it while checking for convergence until hitting a pre-defined maximum regularization.
model selection: select the best model based on BIC/AIC.

What do people think of adding a new class in general? And we are definitely open to other ways to implement it.

bdpedigo commented 3 years ago

Just to briefly clarify the mclust algorithm (what we are proposing to implement here):

run agglomerative clustering (with different options for linkage, affinity, etc.) to generate a set of initial labelings. The same run of agglomerative clustering can be used for various levels of n_components.
fit GaussianMixtures using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}.
choose the best model based on BIC.

As far as we can tell, the above isn't trivially accomplished with GridSearchCV for a few reasons (some of which were already mentioned above, but just repeating here for clarity):

Running agglomerative clustering with multiple different settings, then extracting the appropriate "flat" clustering for the right value of n_components is not hard, but does take a bit of code.
Computing the initial parameters given these clusterings also takes a bit of code, as GaussianMixture currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).
There is no cross-validation involved, meaning one would have to use the "dummy" cross-validation solution described above.
(maybe) there are also some details about how mclust handles the covariance regularization which don't lend themselves to naive grid search easily.

We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV given all of the reasons above. While it wouldn't be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.

tliu68 commented 3 years ago

To clarify what we think is going on with the failing checks. One test check_clustering(readonly_memmap=True) in the estimator checks triggered the following error at the line agg.fit(X_subset) in our code:

ValueError: buffer source array is read-only

We observed that

the error was triggered when running for some specific parameters AgglomerativeClustering(affinity="euclidean", linkage="single"). Using the same random data in check_clustering, we were able to recreate the error only calling AgglomerativeClustering (as shown in the gist)
testing AgglomerativeClustering with the estimator checks is successful. However, in the check_clustering test, AgglomerativeClustering was initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward") (also included in the gist). So this failing test is never actually run during the current sklearn test suite. Also, those parameters are one set of the parameters for linkage and affinity our code sweeps over by default (and hence tested in the estimator checks), and indeed, they were not flagged in this test.
when naively changing our default affinity and linkage parameter to only "euclidean" and "ward" instead of all possible combinations, the test passed.

We were wondering if the error had to do with using memmap backend data for AgglomerativeClustering for some cases due to the following reasons:

the test check_clustering(readonly_memmap=False) passed
the same error was triggered for another test in estimator checks check_estimators_fit_returns_self(readonly_memmap=True) but not for check_estimators_fit_returns_self(readonly_memmap=False)
the error is gone when the following line in the gist is deleted X, y = create_memmap_backed_data([X, y])

Advice greatly appreciated!

bdpedigo commented 3 years ago

To clarify what we think is going on with the failing checks: one test in the estimator checks check_clustering(readonly_memmap=True) triggered the following error at the line agg.fit(X_subset) in our code:

ValueError: buffer source array is read-only

We observed that:

the error was triggered when running for some specific parameters AgglomerativeClustering(affinity="euclidean", linkage="single").
testing AgglomerativeClustering with the estimator checks is successful. However, in the check_clustering test, AgglomerativeClustering was only initialized with the default parameters AgglomerativeClustering(affinity="euclidean", linkage="ward") (included in the gist). Therefore this failing test is never actually run during the current sklearn test suite.
Our code only triggers this error because we sweep over agglomerative initialization schemes, including (affinity="euclidean", linkage="ward"). If the current test for AgglomerativeClustering also tested with these parameters, the test would also fail.
The test only fails when using memmap data. I.e. the error is gone when the following line in the gist is deleted X, y = create_memmap_backed_data([X, y])

So, this ultimately seems like a problem that only comes up in AgglomerativeClustering with the combination of memap data and certain choices of affinity and linkage (such as "euclidean" and "single").

Advice on how to proceed is greatly appreciated!

bdpedigo commented 3 years ago

@jovo @tliu68 i think just a slightly simplified version above that im happy with, thoughts?

tliu68 commented 3 years ago

@jovo @tliu68 i think just a slightly simplified version above that im happy with, thoughts?

I like it! I was worried about my updated one was too lengthy/wordy.

jovo commented 3 years ago

remove 'also'. otherwise, great!

tliu68 commented 3 years ago

We greatly appreciate your interest in our algorithm and your help in improving the code! However, we believe that scikit-learn users might benefit more if the code were integrated into the main package instead of scikit-learn-extra due to the following reasons:

our code deals with Gaussian mixture modeling (GMM), a fundamental tool in clustering, and automates the model selection process documented in scikit-learn. Inspired by mclust, our code also enables initialization with agglomerative clustering, which is not yet an option in the default scikit-learn model selection scheme for GMM.
to our knowledge, our code is the first and only Python version of mclust which is one of the most widely used clustering packages in R, downloaded about 110k times last month and cited by over 3600 publications (Banfield & Raftery, 1993; Scrucca et al., 2016).

Given that our code could be used in either a basic or more advanced GMM model selection scheme, we believe that it could be of general interest to the scikit-learn users. Thank you again for your consideration!

jovo commented 3 years ago

We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn is the right place for this PR.

This PR is essentially a python port of R's mclust, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering, and frequently updated since then {cite all the mclust papers, eg, this one is mclust5 with 1209 citations}, yielding a total of XXX citations over nearly 40 years. It is also one of the most popular R packages ever, with ~110k downloads just this month {provide citation for that}. And, it is also one of the 5 core clustering packages in R (https://cran.r-project.org/web/views/Cluster.html).

Thank you again for your consideration!

tliu68 commented 3 years ago

We greatly appreciate your interest in our algorithm and your help in improving the code! We strongly believe that scikit-learn is the right place for this PR.

This PR is essentially a python port of R's mclust, which is the de facto standard method for model-based clustering from the statistics community. First developed in 1993 (Model-based Gaussian and non-Gaussian clustering), and frequently updated since then (in 1998, 2002, 2003, 2006, 2012, 2014, and 2016), yielding a total of more than 11K citations over nearly 30 years. It is also one of the most popular R packages ever, with ~110k downloads just this month. And, it is also one of the 5 core clustering packages in R.

Thank you again for your consideration!

jovo commented 3 years ago

also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please

tliu68 commented 3 years ago

also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please

I think I have. It's the one in 2002. And I have added its citations. If it looks good, I'll post it?

tliu68 commented 3 years ago

also cite https://www.tandfonline.com/doi/abs/10.1198/016214502760047131 please

I think I have. It's the one in 2002. And I have added its citations. If it looks good, I'll post it?

posted already!

graspologic-org / graspologic