ing-bank / probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
https://ing-bank.github.io/probatus
MIT License
127 stars 40 forks source link

PSI plots #157

Open gverbock opened 3 years ago

gverbock commented 3 years ago

I have some code that could be used to generate a summary related to PSI values: image

It contains:

I put the code below, it can be some inspiration but definitely needs improvement


from feature_engine.outliers import Winsorizer
from probatus.binning import QuantileBucketer, SimpleBucketer
from probatus.stat_tests import psi

def plot_features_psi(train, test, features, label_prefix=""):
    """Provide several plots for a list of features.

    Args:
        train (pd.DataFrame): Feature matrix for the train set.
        test (pd.DataFrame): Features matrix for the test set.
        features (list): List of features for which the plots are made.
        label_prefix (string): Prefix to add to the figure name when saving.
    """
    for feat in features:
        try:
            fig, axes = plt.subplots(2, 2, figsize=(12, 8))

            # Plot the distributions used for the PSI calculations.
            dist1, psi_1 = psi_distribution(train[[feat]], test[[feat]], fold=0.005)
            dist2, psi_2 = psi_distribution(train[[feat]], test[[feat]], buckets=QuantileBucketer(10))
            dist1.plot.bar(ax=axes[1, 0])
            dist2.plot.bar(ax=axes[1, 1])

            # Report the PSI and plot them
            res = pd.DataFrame({"PSI": [psi_1, psi_2], "Approach": ["Equally spaced", "Quantiles"]})
            res.groupby("Approach")["PSI"].mean().plot.bar(ax=axes[0, 1], rot=0)
            axes[0, 1].set_title("PSI values")
            print(res)

            # Plot the time series (mean)
            ts1 = feature_ts(train, test, feat)
            ts1.plot.bar(ax=axes[0, 0])
            axes[0, 0].set_title(feat)

            plt.show()
            # Save file
            fig.savefig(get_root_dir() + "/docs/assets/images/" + label_prefix + feat + ".png")
        except Exception:
            print(f"No information available for {feat}")

    return

def psi_distribution(train_feature, test_feature, fold=1e-9, buckets=SimpleBucketer(10)):
    """Compute the distribution and PSI for a given feature.

    Args:
        train_feature (pd.Series): Feature values for the train set.
        test_feature (pd.Series): Feature values for the test set.
        fold (float): Percentile to cap the right tail of the feature.
        bucket (bucketizer): Bucketer used for computing the PSI.

    Returns:
        results (pd.DataFrame): Normalize distribution,
        psi_value (float): PSI value.
    """
    # Cap the feature
    wins = Winsorizer(capping_method="quantiles", fold=fold)
    d1_wins = wins.fit_transform(train_feature)
    d2_wins = wins.transform(test_feature)

    # Compute distributions
    d1_counts = buckets.fit_compute(d1_wins.iloc[:, 0])
    d2_counts = buckets.compute(d2_wins.iloc[:, 0])
    psi_value = psi(d1_counts, d2_counts)[0]

    # Summarize results.
    results = pd.DataFrame(
        {
            "buckets": np.round(buckets.boundaries[1:], 2),
            "Reference data": d1_counts / d1_counts.sum(),
            "Measurement data": d2_counts / d2_counts.sum(),
        }
    ).set_index("buckets")

    return results, psi_value

def feature_ts(train, test, feature):
    """Combine train and test and make time series for a given feature.

    Args:
        train (pd.DataFrame): Feature matrix for the train set.
        test (pd.DataFrame): Features matrix for the test set.
        feature (string): Label of the feature.

    Returns:
        mean_ts (pd.DataFrame): Time series (monthly) for the feature
    """
    # Aggregate test and train
    combined_df = pd.concat([train, test])[[feature, "scoring_date"]]

    # Define the monthn
    time_dimension = combined_df.scoring_date.apply(lambda x: x[2:7])

    # Compute the mean value of the feature for every month.
    ts = pd.DataFrame({"Time": time_dimension, feature: combined_df[feature]})
    mean_ts = ts.groupby("Time")[feature].mean()

    return mean_ts
ReinierKoops commented 5 months ago

PSI is implemented better at Feature Engine and thus accordingly removed from Probatus. Would you propose to develop a new updated implementation for PSI for Probatus or can we close this issue?