Open timmocking opened 10 months ago
Thanks for raising the issue.
In principle I don't see any blocker in allowing a mapping {class_label: n_components}
as input in place of a single fixed value.
I guess a few things to consider before implementing this.
GMMClassifier
then for consistency we might also need to add this feature to the other mixture methods.In fairness, our docs could do a better job of explaining the bayesian variants of the GMM methods. It feels like they are mainly mentioned here.
Long story short, I think having variable numbers of Gaussians will outperform if one of the classes is associated with multiple clusters in feature space.
Consider a simple implementation below using some artificial data. Using gridsearch, I end up with better scores when using different numbers of Gaussians for each GMM.
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, ClassifierMixin
class GMMClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
self.n_components_class0 = n_components_class0
self.n_components_class1 = n_components_class1
self.covariance_type = covariance_type
def fit(self, X, y):
X_class0 = X[y == 0]
X_class1 = X[y == 1]
self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0, covariance_type=self.covariance_type)
self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1, covariance_type=self.covariance_type)
self.gmm_class0 = self.gmm_class0.fit(X_class0)
self.gmm_class1 = self.gmm_class1.fit(X_class1)
def predict(self, X):
prob_class0 = self.gmm_class0.score_samples(X)
prob_class1 = self.gmm_class1.score_samples(X)
return (prob_class1 > prob_class0).astype(int)
# Generate some example data
X, y = make_blobs(n_samples=5000, cluster_std=[0.8, 2, 1], random_state=0)
# Convert to binary classification by combinining class 1 and 2
y = np.where(y == 2, 1, y)
# Define the parameter grid for grid search
param_grid = {
'n_components_class0': range(1, 11),
'n_components_class1': range(1, 11),
}
# Create the GMM classifier
gmm_classifier = GMMClassifier()
# Create the pipeline for grid search
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=5)
# Fit the model to the data
grid_search.fit(X, y)
# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)
I end up with the best performance for 4 gaussians for one class and 3 for the other.
I can imagine that incorporating this deviates significantly from the style of sklearn classes and might be difficult to combine with other sklearn functionalities such as gridsearch.
I'm familiar with the Bayesian framework, but using these models is infeasible for large numbers of data.
Consider the following:
from sklego.mixture import GMMClassifier, BayesianGMMClassifier
X, y = make_blobs(n_samples=50000, cluster_std=[0.8, 2, 1], random_state=0)
clf = GMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start)
clf = BayesianGMMClassifier(n_components=20, max_iter=1000)
start = time.time()
clf.fit(X, y)
stop = time.time()
print(stop - start)
The GMMClassifier takes less than a second on my PC, while the BayesianGMMClassifier takes around 40 seconds.
I ran the exact same script you shared and got these results.
Best Parameters: {'n_components_class0': 2, 'n_components_class1': 2}
Best Score: 0.9266
This is likely due to randomness, since the data generation tools in sklearn all use some form of np.random
. Also, when I check the difference with the other params, the difference seems very marginal.
pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_n_components_class1', 'param_n_components_class0']].sort_values("mean_test_score")
This yields:
Out[10]:
mean_test_score param_n_components_class1 param_n_components_class0
90 0.9206 1 10
80 0.9208 1 9
60 0.9218 1 7
76 0.9222 7 8
56 0.9224 7 6
.. ... ... ...
21 0.9262 2 3
23 0.9262 4 3
12 0.9264 3 2
1 0.9264 2 1
11 0.9266 2 2
I'm certainly still open to the idea, I would prefer to have a more convincing benchmark also for the docs. Is there a dataset that you can share that would give a clear difference in score?
I guess a final comment on the time it takes to train. The grid search takes a while:
%time grid_search.fit(X, y)
CPU times: user 39.9 s, sys: 26.6 s, total: 1min 6s
Wall time: 10.6 s
Compared to that grid, the bayesian method is a fair bit faster.
from sklego.mixture import GMMClassifier, BayesianGMMClassifier
clf = BayesianGMMClassifier(n_components=10, max_iter=1000)
This yields:
In [18]: %time clf.fit(X, y)
CPU times: user 5.49 s, sys: 2.12 s, total: 7.62 s
Wall time: 2.09 s
Given the radio silence, @FBruzzesi would you be OK if we just update the ticket to reflect that the docs should be updated?
@koaning do you mean to make explicit in the docs that GMMClassifier
does not support a variable number of components?
Hi @koaning, I'm still looking for a good benchmark dataset. We have some internal medical data that would fit this, but I imagine something that can be generated using sklearn datasets would be preferable.
I know that the Bayesian approach is faster than grid-searching, but we often reach a point where we don't have the compute to fit an extremely large Bayesian model on the data that we have (>1,000,000 points, 8+ features). For those datasets, E-M based approaches are the only feasible option.
@FBruzzesi I made a separate issue for the docs https://github.com/koaning/scikit-lego/issues/614 to keep this thread on topic.
@timmocking instead of finding a dataset could you share a benchmark on simulated data? I'm really just interested in justifying the addition and also having a story to share in the docs. As long as the default behavior doesn't change it'll be easy to add/support but I do want to be able to teach people when to consider any new setting.
I hope this example works better than the last one. I adapted the "swiss roll" dataset to create a binary classification example where each class is better approximated with different numbers of Gaussian components.
import numpy as np
from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt
X, univar_pos = make_swiss_roll(n_samples=1500, random_state=0)
# Create a class seperation in the swiss roll
y = np.where((univar_pos < 6) | (univar_pos > 13), 1, 0)
It looks as follows:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")
fig.add_axes(ax)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=50, alpha=0.8)
ax.set_title("Binary swiss roll")
ax.view_init(azim=-66, elev=12)
_ = ax.text2D(0.8, 0.05, s="n_samples=1500", transform=ax.transAxes)
Adopting the code I shared earlier:
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.base import BaseEstimator, ClassifierMixin
class GMMClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, n_components_class0=1, n_components_class1=1, covariance_type='full'):
self.n_components_class0 = n_components_class0
self.n_components_class1 = n_components_class1
self.covariance_type = covariance_type
def fit(self, X, y):
X_class0 = X[y == 0]
X_class1 = X[y == 1]
self.gmm_class0 = GaussianMixture(n_components=self.n_components_class0,
covariance_type=self.covariance_type,
random_state=0)
self.gmm_class1 = GaussianMixture(n_components=self.n_components_class1,
covariance_type=self.covariance_type,
random_state=0)
self.gmm_class0 = self.gmm_class0.fit(X_class0)
self.gmm_class1 = self.gmm_class1.fit(X_class1)
def predict(self, X):
prob_class0 = self.gmm_class0.score_samples(X)
prob_class1 = self.gmm_class1.score_samples(X)
return (prob_class1 > prob_class0).astype(int)
# Define the parameter grid for grid search
param_grid = {
'n_components_class0': range(1, 11),
'n_components_class1': range(1, 11),
}
# Create the GMM classifier
gmm_classifier = GMMClassifier()
# Create the pipeline for grid search
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
grid_search = GridSearchCV(gmm_classifier, param_grid, cv=cv)
# Fit the model to the data
grid_search.fit(X, y)
Currently, GMMClassifier always uses the same number of components to fit a model on each class.
The number of components providing the best "fit" for the data is rarely the same across different classes. In a binary classification task, I would be interested in tuning n_components using gridsearch for each class independently.
I understand that this would be difficult to implement, as n_components is set when initializing GMMClassifier. Is this something that would be interesting to pursue?