koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[FEATURE] Variable number of n_components per class in GMMClassifier #608

Open timmocking opened 10 months ago

timmocking commented 10 months ago

Currently, GMMClassifier always uses the same number of components to fit a model on each class.

The number of components providing the best "fit" for the data is rarely the same across different classes. In a binary classification task, I would be interested in tuning n_components using gridsearch for each class independently.

I understand that this would be difficult to implement, as n_components is set when initializing GMMClassifier. Is this something that would be interesting to pursue?

FBruzzesi commented 10 months ago

Thanks for raising the issue. In principle I don't see any blocker in allowing a mapping {class_label: n_components} as input in place of a single fixed value.

koaning commented 10 months ago

I guess a few things to consider before implementing this.

  1. @timmocking do you have a dataset/use-case that demonstrates that setting these components manually is beneficial?
  2. If we make this change for the GMMClassifier then for consistency we might also need to add this feature to the other mixture methods.
  3. Our library also supports GMM techniques build on top of the BayesianGuassianMixtureModel ... doesn't that offer a technique to have different components per class in our case as well?
koaning commented 10 months ago

In fairness, our docs could do a better job of explaining the bayesian variants of the GMM methods. It feels like they are mainly mentioned here.

timmocking commented 10 months ago
  1. I'm interested in a particular use-case in biological data analysis. A common technique (flow cytometry) for diagnostics of diseases in hospitals around the world involves datasets containing markers (columns) that are measured for a large number of cells (rows). The dimensions for this type of data are typically 8 - 32 columns, at least 500,000 rows for a single measurement. Data analysis in hospitals is currently performed manually by identifying distinct cell clusters by hand. I'm involved in research which aims to automate these procedures using clustering and/or classification algorithms. When aiming to identify a cell population, the current "go-to" approach is to cluster the data and identify a cluster enriched for this cell type (e.g., 95%) based on manual annotation. I'm interested to see if a binary classification algorithm based on two GMMs can outperform this. However, you can imagine that a single cell population might be modeled more efficiently using a single Gaussian, while all other cell types in a dataset might be more better described by multiple Gaussians.

Long story short, I think having variable numbers of Gaussians will outperform if one of the classes is associated with multiple clusters in feature space.

Consider a simple implementation below using some artificial data. Using gridsearch, I end up with better scores when using different numbers of Gaussians for each GMM.

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, ClassifierMixin

class GMMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
        self.n_components_class0 = n_components_class0
        self.n_components_class1 = n_components_class1
        self.covariance_type = covariance_type

    def fit(self, X, y):
        X_class0 = X[y == 0]
        X_class1 = X[y == 1]
        self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0, covariance_type=self.covariance_type)
        self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1, covariance_type=self.covariance_type)
        self.gmm_class0 = self.gmm_class0.fit(X_class0)
        self.gmm_class1 = self.gmm_class1.fit(X_class1)

    def predict(self, X):
        prob_class0 = self.gmm_class0.score_samples(X)
        prob_class1 = self.gmm_class1.score_samples(X)
        return (prob_class1 > prob_class0).astype(int)

# Generate some example data
X, y = make_blobs(n_samples=5000, cluster_std=[0.8, 2, 1], random_state=0)
# Convert to binary classification by combinining class 1 and 2
y = np.where(y == 2, 1, y)

# Define the parameter grid for grid search
param_grid = {
    'n_components_class0': range(1, 11),
    'n_components_class1': range(1, 11),
}

# Create the GMM classifier
gmm_classifier = GMMClassifier()

# Create the pipeline for grid search
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=5)

# Fit the model to the data
grid_search.fit(X, y)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

I end up with the best performance for 4 gaussians for one class and 3 for the other.

  1. I can imagine that incorporating this deviates significantly from the style of sklearn classes and might be difficult to combine with other sklearn functionalities such as gridsearch.

  2. I'm familiar with the Bayesian framework, but using these models is infeasible for large numbers of data.

Consider the following:

from sklego.mixture import GMMClassifier, BayesianGMMClassifier

X, y = make_blobs(n_samples=50000, cluster_std=[0.8, 2, 1], random_state=0)

clf = GMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start)

clf = BayesianGMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start)

The GMMClassifier takes less than a second on my PC, while the BayesianGMMClassifier takes around 40 seconds.

koaning commented 10 months ago

I ran the exact same script you shared and got these results.

Best Parameters: {'n_components_class0': 2, 'n_components_class1': 2}
Best Score: 0.9266

This is likely due to randomness, since the data generation tools in sklearn all use some form of np.random. Also, when I check the difference with the other params, the difference seems very marginal.

pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_n_components_class1', 'param_n_components_class0']].sort_values("mean_test_score")

This yields:

Out[10]: 
    mean_test_score param_n_components_class1 param_n_components_class0
90           0.9206                         1                        10
80           0.9208                         1                         9
60           0.9218                         1                         7
76           0.9222                         7                         8
56           0.9224                         7                         6
..              ...                       ...                       ...
21           0.9262                         2                         3
23           0.9262                         4                         3
12           0.9264                         3                         2
1            0.9264                         2                         1
11           0.9266                         2                         2

I'm certainly still open to the idea, I would prefer to have a more convincing benchmark also for the docs. Is there a dataset that you can share that would give a clear difference in score?

I guess a final comment on the time it takes to train. The grid search takes a while:

%time grid_search.fit(X, y)

CPU times: user 39.9 s, sys: 26.6 s, total: 1min 6s
Wall time: 10.6 s

Compared to that grid, the bayesian method is a fair bit faster.

from sklego.mixture import GMMClassifier, BayesianGMMClassifier
clf = BayesianGMMClassifier(n_components=10, max_iter=1000)

This yields:

In [18]: %time clf.fit(X, y)

CPU times: user 5.49 s, sys: 2.12 s, total: 7.62 s
Wall time: 2.09 s
koaning commented 9 months ago

Given the radio silence, @FBruzzesi would you be OK if we just update the ticket to reflect that the docs should be updated?

FBruzzesi commented 9 months ago

@koaning do you mean to make explicit in the docs that GMMClassifier does not support a variable number of components?

timmocking commented 9 months ago

Hi @koaning, I'm still looking for a good benchmark dataset. We have some internal medical data that would fit this, but I imagine something that can be generated using sklearn datasets would be preferable.

I know that the Bayesian approach is faster than grid-searching, but we often reach a point where we don't have the compute to fit an extremely large Bayesian model on the data that we have (>1,000,000 points, 8+ features). For those datasets, E-M based approaches are the only feasible option.

koaning commented 9 months ago

@FBruzzesi I made a separate issue for the docs https://github.com/koaning/scikit-lego/issues/614 to keep this thread on topic.

@timmocking instead of finding a dataset could you share a benchmark on simulated data? I'm really just interested in justifying the addition and also having a story to share in the docs. As long as the default behavior doesn't change it'll be easy to add/support but I do want to be able to teach people when to consider any new setting.

timmocking commented 9 months ago

I hope this example works better than the last one. I adapted the "swiss roll" dataset to create a binary classification example where each class is better approximated with different numbers of Gaussian components.

import numpy as np
from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt

X, univar_pos = make_swiss_roll(n_samples=1500, random_state=0)
# Create a class seperation in the swiss roll
y = np.where((univar_pos < 6) | (univar_pos > 13), 1, 0)

It looks as follows:

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")
fig.add_axes(ax)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=50, alpha=0.8)
ax.set_title("Binary swiss roll")
ax.view_init(azim=-66, elev=12)
_ = ax.text2D(0.8, 0.05, s="n_samples=1500", transform=ax.transAxes)

swiss_roll

Adopting the code I shared earlier:

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.base import BaseEstimator, ClassifierMixin

class GMMClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
        self.n_components_class0 = n_components_class0
        self.n_components_class1 = n_components_class1
        self.covariance_type = covariance_type

    def fit(self, X, y):
        X_class0 = X[y == 0]
        X_class1 = X[y == 1]
        self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0, 
                                          covariance_type=self.covariance_type,
                                          random_state=0)
        self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1, 
                                          covariance_type=self.covariance_type,
                                          random_state=0)
        self.gmm_class0 = self.gmm_class0.fit(X_class0)
        self.gmm_class1 = self.gmm_class1.fit(X_class1)

    def predict(self, X):
        prob_class0 = self.gmm_class0.score_samples(X)
        prob_class1 = self.gmm_class1.score_samples(X)
        return (prob_class1 > prob_class0).astype(int)

# Define the parameter grid for grid search
param_grid = {
    'n_components_class0': range(1, 11),
    'n_components_class1': range(1, 11),
}

# Create the GMM classifier
gmm_classifier = GMMClassifier()

# Create the pipeline for grid search
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=cv)

# Fit the model to the data
grid_search.fit(X, y)

Screenshot from 2024-01-30 12-13-25