Feature Request: Cooperative Learning via Custom Objective or Kind

fsaforo1 commented 12 months ago

@interpret-ml @paulbkochms

First off, thanks for building this amazing tool!

The Request

I am interested in exploring the implementation of cooperative learning in EBMs through a specialized loss objective. This objective would allow EBMs to learn from an ensemble of additive models, each corresponding to different feature sets, and encourage these models to work in a cooperative manner.

Practical Example: Air Quality and Public Health Modeling

Scenario: Environmental scientists are tasked with assessing the influence of air pollution on public health within urban settings. They collate data from diverse streams:

Meteorological Data (View A): Metrics include temperature, humidity, wind speed, and barometric pressure.
Pollutant Concentration Data (View B): Concentration levels of particulates and gases such as PM2.5, PM10, NO2, SO2, CO, and O3.
Socioeconomic Data (View C): Data encompasses population density, rates of urbanization, and indices of socioeconomic status.

Meteorological patterns are known to modulate pollutant dispersal and concentrations, which in turn have direct consequences on health outcomes. Socioeconomic factors further modulate a population's exposure and susceptibility to pollution-related health risks.

Proposed Objective Function

I propose a cooperative loss objective to be optimized, as follows, considering the first two views for simplicity:

$$ \min_{f, g} \frac{1}{2} \sum_i \left(y_i - \sum_j fj(A{ij}) - \sum_k gk(B{ik})\right)^2 + \frac{\rho}{2} \sum_i \left(\sum_j fj(A{ij}) - \sum_k gk(B{ik})\right)^2 $$

Where:

$f_j(\cdot)$ and $g_k(\cdot)$ correspond to the EBMs trained on feature sets A and B.
The first term quantifies the prediction error for all observations $i$.
The second term imposes an agreement constraint between the models on A and B, modulated by the hyperparameter $\rho$.

Implication of the $\rho$ Parameter: The parameter $\rho$ is essential for tuning the degree of cooperation between the different data views:

$\rho = 0$ aligns with a single model built using the combined data sets from all views, A and B. This is what is typically done where you throw all your features in one model.
$\rho = 1$ parallels traditional ensemble methods, where separate EBMs are trained for A and B, with their predictions combined post hoc.
A high $\rho$ value enforces a strong alignment between the models, underscoring the direct influence of meteorological conditions on air quality.
A mid-range $\rho$ value offers a balance, allowing each model to contribute to the final prediction while maintaining a level of agreement that reflects the complex interplay between weather patterns and pollution's health impacts.

What the $\rho$ parameter could be doing during training The $\rho$ parameter essentially the period in learning where learnings from different views can be combined. For example:

The summary from A, which is the independent EBM model using only features from A, could have very high synergistic impact on predictive performance when interacted with just one of the feature sets of B, and not the summary of B (i.e. the EBM model for B) which may be distorted since it's a combination of other features in B. You could lose that additional information if you just ensemble the EBM model for A and B

Some Thoughts on Potential Implementation

group interaction pairs: Is it possible to automatically assess systematic pairs of interactions where the combination of a set features can be interacted with a single (or another set of features). Example of explicitly defining systematic pair of interactions: ExplainableBoostingRegressor(interactions = [([X_feats], [B_feats]), ([C_feats], 'feat_12')]). The ideal will be if the strength of such group pairs could be automatically assessed during training.
merge_ebms: This could be possible with a method similar to merge_ebms if:
- it allowed merging models with different feature sets, and if
- training could be done within this class. It could take then take a list of the various un-fitted EBMs instances, and an objective similar to how objectives with hyperparameters like the Tweedie is implemented
- the fit method then takes a as input X = [A, B, C] and y=y
- Since EBMs are additive would an implementation for this custom loss be as simple as just fitting independent models, and subsequently optimizing their predictions using the proposed loss function as follows:

import numpy as np
from interpret.glassbox import ExplainableBoostingRegressor
from scipy.optimize import minimize
from sklearn.base import BaseEstimator, RegressorMixin

class CooperativeMultiViewEBMRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, rho=0.5, view1_params={}, view2_params={}):
        self.rho = rho
        self.view1_params = view1_params
        self.view2_params = view2_params
        self.model1 = ExplainableBoostingRegressor(**self.view1_params)
        self.model2 = ExplainableBoostingRegressor(**self.view2_params)
        self.weights_ = None

    def fit(self, X1, X2, y):
        # Fit the individual EBMs to each view
        self.model1.fit(X1, y)
        self.model2.fit(X2, y)

        # Initial prediction to calculate initial weights
        pred1 = self.model1.predict(X1)
        pred2 = self.model2.predict(X2)

        # Define the cooperative loss function
        def cooperative_loss(weights):
            combined_pred = weights[0] * pred1 + weights[1] * pred2
            agreement_term = self.rho * np.sum((pred1 - pred2)**2)
            return np.sum((y - combined_pred)**2) + agreement_term

        # Initial weights (evenly distributed)
        initial_weights = np.array([0.5, 0.5])

        # Minimize the cooperative loss to find the best weights
        result = minimize(cooperative_loss, initial_weights, method='L-BFGS-B', bounds=[(0, 1), (0, 1)])
        self.weights_ = result.x

        return self

    def predict(self, X1, X2):
        # Predict using the individual EBMs
        pred1 = self.model1.predict(X1)
        pred2 = self.model2.predict(X2)

        # Combine predictions using the optimized weights
        return self.weights_[0] * pred1 + self.weights_[1] * pred2

Some Practical Rationale for Cooperative Learning

Utilizing cooperative learning, researchers can harness various data views for enhanced predictions and insights. In the context of our air quality problem, objectives include:

Creating a model that uses meteorological data to estimate pollution levels, recognizing the influence of weather.
Ensuring consistent predictions across meteorological and pollutant models to pinpoint when pollution poses heightened health risks.
Making holistic (broad) comparative analyses on the relevance of different factors on air quality and public health
Providing detailed analysis of how the interplay between meteorological and socio-economic factors affect health outcomes

paulbkoch commented 11 months ago

Hi @fsaforo1 -- This is very interesting. To be honest, I don't fully understand the implications of this cooperative objective vs training all the features in a single model. If you're interested in meeting with us and discussing it, send us an email at interpret@microsoft.com

Just a couple of quick thoughts that come to mind: 1) Unless you are using an identity link function, you probably want to apply the link function to the predictions returned from self.model.predict, then find the optimized weights in the additive domain. During predict you'll want to apply the link function again, re-weight the predictions, then reapply the inverse link function. 2) mergeebms does indeed currently require identical feature sets, but it does not require identical additive terms. One quick hack to make this work would be to build the two EBMs using a superset of the features. You can use the "exclude" parameter of the __init_\ function to exclude the B terms from the A model that you build, and vice versa. You'll also need to exclude all the possible interaction terms from the features that you don't want to cross contaminate. There's another tricky aspect that when you merge EBMs where some of the terms are missing in the other model, it currently assumes the term values on the other EBM are essentially zero, which means averaging will decrease their contribution, whereas for this merge you want them to maintain their full strength. You can fix this issue by scaling the models prior to merging by a factor of 2.0, given they share the same 'y' in this example. (see: https://interpret.ml/docs/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.scale) 3) We also expose a "measure_interactions" function that allows you to customize interaction detection. This might be useful if you want to customize the interaction detection to allow pairs across the A/B feature separation. You can then re-train your models while specifying the interactions explicitly. https://interpret.ml/docs/measure_interactions.html

paulbkoch commented 11 months ago

@fsaforo1, you might find this other thread regarding reweighing terms interesting https://github.com/interpretml/interpret/issues/460

hoangthienan95 commented 11 months ago

@fsaforo1 You might be interested in this package for multi-view/multi-modal data: https://mvlearn.github.io/ . Maybe you can use EBM as the model for each of the 3 views you mentioned, then train those 3 EBM models using mvlearn so they can be trained in a way that account for complementing views that hold differing statistical properties.

interpretml / interpret