Open beckernick opened 3 years ago
Would a wrapper for bootstrap standard error/confidence intervals be appealing or do you only want the analytical version? The bootstrap approach could be flexible to many estimators and have very nice statistical properties.
Agreed. Bootstrapped standard errors would be very appealing and appropriate, within the context of performance considerations
After doing some research, we should be able back out the covariance matrix and thus the analytical standard errors from the logit coefficients predicting on the original data, which is likely sufficient for some use cases.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import statsmodels.api as sm
np.random.seed(12)
X, y = make_classification(
n_samples=5000,
n_features=8,
n_redundant=0,
)
X_mat = sm.add_constant(X)
model = sm.Logit(y, X_mat)
results = model.fit()
clf = LogisticRegression(
penalty="none",
fit_intercept=True,
solver="newton-cg",
)
clf.fit(X, y)
p_one_minus_p_probs = np.product(clf.predict_proba(X), axis=1).reshape(-1,1)
cov_mat = np.linalg.pinv(np.dot((X_mat * p_one_minus_p_probs).T, X_mat))
std_errors = np.sqrt(np.diag(cov_mat))
print(results.summary())
print(round(clf.intercept_[0], 4))
for coef, stde in zip(clf.coef_.ravel(), std_errors):
print(round(coef, 4), round(stde, 3))
Optimization terminated successfully.
Current function value: 0.104385
Iterations 10
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5000
Model: Logit Df Residuals: 4991
Method: MLE Df Model: 8
Date: Tue, 02 Mar 2021 Pseudo R-squ.: 0.8494
Time: 09:38:00 Log-Likelihood: -521.92
converged: True LL-Null: -3465.7
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 0.7617 0.094 8.077 0.000 0.577 0.946
x1 -0.0353 0.084 -0.420 0.675 -0.200 0.130
x2 -0.0274 0.087 -0.315 0.753 -0.198 0.143
x3 0.0138 0.082 0.169 0.866 -0.147 0.174
x4 0.0556 0.082 0.682 0.495 -0.104 0.215
x5 6.1217 0.241 25.383 0.000 5.649 6.594
x6 0.0032 0.086 0.038 0.970 -0.166 0.173
x7 -2.0569 0.122 -16.879 0.000 -2.296 -1.818
x8 0.0094 0.082 0.115 0.908 -0.151 0.170
==============================================================================
Possibly complete quasi-separation: A fraction 0.17 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
0.7617
-0.0353 0.094
-0.0274 0.084
0.0138 0.087
0.0556 0.082
6.1217 0.082
0.0032 0.241
-2.0569 0.086
0.0094 0.122
This is with scikit-learn, but cuml would behave the same. The different solver would result in slightly different results, but they would be equally valid.
Looks good... here is a similar cupy version that could be optimized more (esp to reduce memory usage from temporaries) but is fairly reasonable:
def cuml_logistic_stderr(model, X, dtype="float32"):
"""
Returns standard errors of estimated coefficients of a logistic regression model
Does NOT support weights yet (but neither does cuml.LogisticRegression)
TODO: ensure that regularization is handled correctly.
Parameters
==========
model: Previously-fit instance of cuml.LogisticRegression
X: Numpy or cupy array of features used to fit model
"""
assert isinstance(model, cuml.LogisticRegression)
X_cp = cp.asarray(X, dtype=dtype)
# Compute Hessian of log likelihood
with cuml.using_output_type("cupy"):
y_hat = cu_model.predict_proba(X)[:,1]
p = y_hat * (1.0 - y_hat)
p_mat = cu_sparse.spdiags(p, 0, y_hat.size, y_hat.size)
H = -X_cp.T.dot(p_mat.dot(X_cp))
# TODO regularization: H += 2.0/model.C if penalty == l2
# TODO: same but with (2.0/model.C) * (1 - model.l1_ratio) for elastic
# Stderr is sqrt of main diagonal of inverse neg hessian
h_inv_diag = cp.diag(cp.linalg.pinv(-H))
cp_stderr = cp.sqrt(h_inv_diag)
return cp_stderr
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I'd like to be able to access the standard errors of coefficients for logistic regression and other GLMs. One way to compute the standard errors could be to take the square root of the diagonal of the covariance matrix. It might be nice to expose the standard errors in a friendly way for users.