Open nauanchik opened 1 year ago
Hi @nauanchik,
This feature would be an interesting addition to the scorecard class. Would you be interested in implementing it and PR?
Unfortunately, I am too unskilled in coding to help you on adding the feature to the library. As for the PR, I always recommend this library at my workplace to the colleagues because it saves an enormous amount of time
Ok, no worries. I will find the time to implement it.
@guillermo-navas-palencia If you are still looking for some support, I can work on this feature of getting p-values. Let me know and I can start looking into it.
Thanks @jnsofini. That would be great!
This would be a great enhancement.
@ guillermo-navas-palencia I thought about implementing this directly in the code. Also, I read about why it is not included in the Scikit-learn library and implementing it is not a wise decision. This is for the following reasons.
As a result, I am considering putting my results as a tutorial on the Optbinning page. Please let me know what your thoughts are..
Here is the code. I can use it build a scorecard with Optbinning and then provide summary stats of the p-values of the coefficients.
from sklearn.linear_model import LogisticRegression
import scipy.stats as stat
import numpy as np
import pandas as pd
class LogisticRegressionPValues(LogisticRegression):
"""Logistic regression model with p-value computation for coefficients and z-score statistics.
This class extends the scikit-learn's LogisticRegression to include the computation of p-values
and z-scores for the coefficients after fitting the logistic regression model.
"""
def fit(self, X, y, **kwargs):
"""
Fit the logistic regression model and compute p-values and z-scores for the coefficients.
Parameters:
X (array-like or sparse matrix): Training data.
y (array-like): Target values.
**kwargs: Additional keyword arguments to pass to the base LogisticRegression.fit().
"""
super().fit(X, y, **kwargs)
self.p_values, self.z_scores = self.get_pvalues(X)
def get_pvalues(self, X):
"""
Compute the p-values and z-scores for the fitted model.
Parameters:
X (array-like or sparse matrix): Training data.
Returns:
p_values (list): Two-tailed p-values for each model coefficient.
z_scores (array-like): Z-scores for each model coefficient.
"""
return self.get_stats(self.decision_function(X), X, self.coef_[0])
@staticmethod
def get_stats(decision_boundary, X, coef):
"""
Compute the p-values and z-scores for the fitted model.
Parameters:
decision_boundary (array-like): Decision function values for the training data.
X (array-like or sparse matrix): Training data.
coef (array-like): Model coefficients.
Returns:
p_values (list): Two-tailed p-values for each model coefficient.
z_scores (array-like): Z-scores for each model coefficient.
"""
cramer_rao = LogisticRegressionPValues.fisher_matrix(decision_boundary, X)
sigma_estimates = np.sqrt(np.diagonal(cramer_rao))
z_scores = coef / sigma_estimates # Z-score for each model coefficient
p_values = [stat.norm.sf(abs(z)) * 2 for z in z_scores] # Two-tailed test for p-values
return p_values, z_scores
@staticmethod
def fisher_matrix(decision_boundary, X):
"""
Compute the Fisher Information Matrix for the logistic regression model.
Parameters:
decision_boundary (array-like): Decision function values for the training data.
X (array-like or sparse matrix): Training data.
Returns:
cramer_rao (array-like): Inverse Information Matrix (Cramer-Rao).
"""
denom = (2.0 * (1.0 + np.cosh(decision_boundary)))
denom = np.tile(denom, (X.shape[1], 1)).T
fisher_matrix = np.dot((X / denom).T, X) # Fisher Information Matrix
cramer_rao = np.linalg.inv(fisher_matrix) # Inverse Information Matrix
return cramer_rao
def z_statistics(self):
"""
Return a DataFrame containing z-statistics, p-values, and coefficients for each feature.
Returns:
pd.DataFrame: DataFrame containing z-statistics, p-values, and coefficients for each feature.
Columns: ["Feature", "Coef", "z-score", "p-values"]
"""
return pd.DataFrame(
zip(self.feature_names_in_, self.coef_[0], self.p_values, self.z_scores),
columns=["Feature", "Coef", "z-score", "p-values"]
)
Dear Guillermo,
Thank you for the library, it really does help at my work.
However, I wonder how can I get the p-values of all explanatory variables once the logistic regression was fitted and and the binning process complete? I'd like to have an output similiar to statsmodels library.