scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.24k stars 25.43k forks source link

DOC Example on model selection for Gaussian Mixture Models #30323

Open ogrisel opened 4 days ago

ogrisel commented 4 days ago

Describe the issue linked to the documentation

We have an example that illustrates how to use the BIC score to tune the number of components and the type of covariance matrix parametrization here:

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html

However, the BIC score is not meant to be computed in a CV loop, but instead directly on the training set. So we should not use it with a GridSearchCV call. Indeed, the BIC score already penalizes the number of parameters depending on the number of data-points in the training set.

Instead, we should call the GridSearchCV on the default .score method of the GMM estimator, which computes the log-likelihood and is a perfectly fine metric to select the best model on held out data in a CV loop.

Note that we can keep computing the BIC score for all the hparam combinations but we should either do it in a single for loop (without train-test split), e.g.:

from sklearn.model_selection import ParameterGrid
from sklearn.mixture import GaussianMixture
import pandas as pd
import numpy as np

n_samples = 500
rng = np.random.default_rng(0)
C = np.array([[0.0, -0.1], [1.7, 0.4]])
component_1 = rng.normal(size=(n_samples, 2)) @ C  # general
component_2 = 0.7 * rng.normal(size=(n_samples, 2)) + np.array([-4, 1])  # spherical
X = np.concatenate([component_1, component_2])

param_grid = {
    "n_components": np.arange(1, 7),
    "covariance_type": ["full", "tied", "diag", "spherical"],
}

bic_evaluations = []
for params in ParameterGrid(param_grid):
    bic_value = GaussianMixture(**params).fit(X).bic(X)
    bic_evaluations.append({**params, "BIC": bic_value})

bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
bic_evaluations.head()

So in summary I would recommend to:

We can then check that the two methods select the same model: 2 components with "full" (non-spherical) parametrization of the covariance matrix.

Yuvraj-Pradhan-27 commented 4 days ago

I think the things you asked for should be implemented because these changes ensure correct usage of model selection methods with GaussianMixture.

I tried the recommendations you mentioned, the result-

Screenshot 2024-11-22 at 1 36 00 AM

Yuvraj-Pradhan-27 commented 3 days ago

@glemaitre review my comment and let me know if the changes are to be made?

ogrisel commented 3 days ago

@Yuvraj-Pradhan-27: your original comment feels like the output of an LLM paraphrasing the description I wrote in the issue. If that's the case, please refrain from posting such redundant and verbose comments in the future. If that's not the case, then I am sorry ;)

I see that you opened two PRs:

I closed the first one because it felt subsumed by #30329.

When opening PRs, please refer to the issue they are related to in the description to make it easier to discover them and avoid having several people concurrently working on the same issue.

Yuvraj-Pradhan-27 commented 3 days ago

I did paraphrase the same description you wrote the point was to support your recommendations only,I executed the code you mentioned. I am new to GitHub contributions, I am really sorry if I did something which us unwanted, will take care of it from next time.

ogrisel commented 3 days ago

No problem. If you are new to contributing to scikit-learn, please make sure to follow the guidelines from: https://scikit-learn.org/dev/developers/index.html

And welcome to the project!

Yuvraj-Pradhan-27 commented 3 days ago

I will go through the document thoroughly, thanks for the help.