guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
435 stars 98 forks source link

P-values of variables #224

Open nauanchik opened 1 year ago

nauanchik commented 1 year ago

Dear Guillermo,

Thank you for the library, it really does help at my work.

However, I wonder how can I get the p-values of all explanatory variables once the logistic regression was fitted and and the binning process complete? I'd like to have an output similiar to statsmodels library.

guillermo-navas-palencia commented 1 year ago

Hi @nauanchik,

This feature would be an interesting addition to the scorecard class. Would you be interested in implementing it and PR?

nauanchik commented 1 year ago

Unfortunately, I am too unskilled in coding to help you on adding the feature to the library. As for the PR, I always recommend this library at my workplace to the colleagues because it saves an enormous amount of time

guillermo-navas-palencia commented 1 year ago

Ok, no worries. I will find the time to implement it.

jnsofini commented 1 year ago

@guillermo-navas-palencia If you are still looking for some support, I can work on this feature of getting p-values. Let me know and I can start looking into it.

guillermo-navas-palencia commented 1 year ago

Thanks @jnsofini. That would be great!

detrin commented 1 year ago

This would be a great enhancement.

jnsofini commented 11 months ago

@ guillermo-navas-palencia I thought about implementing this directly in the code. Also, I read about why it is not included in the Scikit-learn library and implementing it is not a wise decision. This is for the following reasons.

  1. We will have many other estimators without p-values, like the decision tree base classifier. So to have this only for logistic regression would not be helpful.
  2. The second reason is that it needs to be clarified how to calculate these when we use regularization like l1 and l2 in scikit-learn. It might mislead users when they use regularization as the answers might differ from what they think they I due to the use of fisher information.

As a result, I am considering putting my results as a tutorial on the Optbinning page. Please let me know what your thoughts are..

Here is the code. I can use it build a scorecard with Optbinning and then provide summary stats of the p-values of the coefficients.

from sklearn.linear_model import LogisticRegression
import scipy.stats as stat
import numpy as np
import pandas as pd

class LogisticRegressionPValues(LogisticRegression):
    """Logistic regression model with p-value computation for coefficients and z-score statistics.

    This class extends the scikit-learn's LogisticRegression to include the computation of p-values 
    and z-scores for the coefficients after fitting the logistic regression model.
    """

    def fit(self, X, y, **kwargs):
        """
        Fit the logistic regression model and compute p-values and z-scores for the coefficients.

        Parameters:
            X (array-like or sparse matrix): Training data.
            y (array-like): Target values.
            **kwargs: Additional keyword arguments to pass to the base LogisticRegression.fit().
        """
        super().fit(X, y, **kwargs)
        self.p_values, self.z_scores = self.get_pvalues(X)

    def get_pvalues(self, X):
        """
        Compute the p-values and z-scores for the fitted model.

        Parameters:
            X (array-like or sparse matrix): Training data.

        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        return self.get_stats(self.decision_function(X), X, self.coef_[0])

    @staticmethod
    def get_stats(decision_boundary, X, coef):
        """
        Compute the p-values and z-scores for the fitted model.

        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.
            coef (array-like): Model coefficients.

        Returns:
            p_values (list): Two-tailed p-values for each model coefficient.
            z_scores (array-like): Z-scores for each model coefficient.
        """
        cramer_rao = LogisticRegressionPValues.fisher_matrix(decision_boundary, X)
        sigma_estimates = np.sqrt(np.diagonal(cramer_rao))
        z_scores = coef / sigma_estimates  # Z-score for each model coefficient
        p_values = [stat.norm.sf(abs(z)) * 2 for z in z_scores]  # Two-tailed test for p-values

        return p_values, z_scores

    @staticmethod
    def fisher_matrix(decision_boundary, X):
        """
        Compute the Fisher Information Matrix for the logistic regression model.

        Parameters:
            decision_boundary (array-like): Decision function values for the training data.
            X (array-like or sparse matrix): Training data.

        Returns:
            cramer_rao (array-like): Inverse Information Matrix (Cramer-Rao).
        """
        denom = (2.0 * (1.0 + np.cosh(decision_boundary)))
        denom = np.tile(denom, (X.shape[1], 1)).T
        fisher_matrix = np.dot((X / denom).T, X)  # Fisher Information Matrix
        cramer_rao = np.linalg.inv(fisher_matrix)  # Inverse Information Matrix

        return cramer_rao

    def z_statistics(self):
        """
        Return a DataFrame containing z-statistics, p-values, and coefficients for each feature.

        Returns:
            pd.DataFrame: DataFrame containing z-statistics, p-values, and coefficients for each feature.
                Columns: ["Feature", "Coef", "z-score", "p-values"]
        """
        return pd.DataFrame(
            zip(self.feature_names_in_, self.coef_[0], self.p_values, self.z_scores),
            columns=["Feature", "Coef", "z-score", "p-values"]
        )