[FEA] Expose or calculate standard errors for logistic regression (and other GLMs)

beckernick commented 3 years ago

I'd like to be able to access the standard errors of coefficients for logistic regression and other GLMs. One way to compute the standard errors could be to take the square root of the diagonal of the covariance matrix. It might be nice to expose the standard errors in a friendly way for users.

JohnZed commented 3 years ago

Would a wrapper for bootstrap standard error/confidence intervals be appealing or do you only want the analytical version? The bootstrap approach could be flexible to many estimators and have very nice statistical properties.

beckernick commented 3 years ago

Agreed. Bootstrapped standard errors would be very appealing and appropriate, within the context of performance considerations

beckernick commented 3 years ago

After doing some research, we should be able back out the covariance matrix and thus the analytical standard errors from the logit coefficients predicting on the original data, which is likely sufficient for some use cases.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import statsmodels.api as sm


np.random.seed(12)
X, y = make_classification(
    n_samples=5000,
    n_features=8,
    n_redundant=0,
)

X_mat = sm.add_constant(X)
model = sm.Logit(y, X_mat)
results = model.fit()

clf = LogisticRegression(
    penalty="none",
    fit_intercept=True,
    solver="newton-cg",
)
clf.fit(X, y)

p_one_minus_p_probs = np.product(clf.predict_proba(X), axis=1).reshape(-1,1)
cov_mat = np.linalg.pinv(np.dot((X_mat * p_one_minus_p_probs).T, X_mat))
std_errors = np.sqrt(np.diag(cov_mat))

print(results.summary())
print(round(clf.intercept_[0], 4))

for coef, stde in zip(clf.coef_.ravel(), std_errors):
    print(round(coef, 4), round(stde, 3))
Optimization terminated successfully.
         Current function value: 0.104385
         Iterations 10
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 5000
Model:                          Logit   Df Residuals:                     4991
Method:                           MLE   Df Model:                            8
Date:                Tue, 02 Mar 2021   Pseudo R-squ.:                  0.8494
Time:                        09:38:00   Log-Likelihood:                -521.92
converged:                       True   LL-Null:                       -3465.7
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.7617      0.094      8.077      0.000       0.577       0.946
x1            -0.0353      0.084     -0.420      0.675      -0.200       0.130
x2            -0.0274      0.087     -0.315      0.753      -0.198       0.143
x3             0.0138      0.082      0.169      0.866      -0.147       0.174
x4             0.0556      0.082      0.682      0.495      -0.104       0.215
x5             6.1217      0.241     25.383      0.000       5.649       6.594
x6             0.0032      0.086      0.038      0.970      -0.166       0.173
x7            -2.0569      0.122    -16.879      0.000      -2.296      -1.818
x8             0.0094      0.082      0.115      0.908      -0.151       0.170
==============================================================================

Possibly complete quasi-separation: A fraction 0.17 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
0.7617
-0.0353 0.094
-0.0274 0.084
0.0138 0.087
0.0556 0.082
6.1217 0.082
0.0032 0.241
-2.0569 0.086
0.0094 0.122

This is with scikit-learn, but cuml would behave the same. The different solver would result in slightly different results, but they would be equally valid.

JohnZed commented 3 years ago

Looks good... here is a similar cupy version that could be optimized more (esp to reduce memory usage from temporaries) but is fairly reasonable:

def cuml_logistic_stderr(model, X, dtype="float32"):
    """
    Returns standard errors of estimated coefficients of a logistic regression model
    Does NOT support weights yet (but neither does cuml.LogisticRegression)

    TODO: ensure that regularization is handled correctly.

    Parameters
    ==========
    model: Previously-fit instance of cuml.LogisticRegression
    X:     Numpy or cupy array of features used to fit model
    """
    assert isinstance(model, cuml.LogisticRegression)
    X_cp = cp.asarray(X, dtype=dtype)

    # Compute Hessian of log likelihood
    with cuml.using_output_type("cupy"):
        y_hat = cu_model.predict_proba(X)[:,1]
    p = y_hat * (1.0 - y_hat)

    p_mat = cu_sparse.spdiags(p, 0, y_hat.size, y_hat.size)    
    H = -X_cp.T.dot(p_mat.dot(X_cp))

    # TODO regularization: H += 2.0/model.C if penalty == l2
    # TODO: same but with (2.0/model.C) * (1 - model.l1_ratio) for elastic

    # Stderr is sqrt of main diagonal of inverse neg hessian
    h_inv_diag = cp.diag(cp.linalg.pinv(-H))
    cp_stderr = cp.sqrt(h_inv_diag)

    return cp_stderr

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapidsai / cuml

[FEA] Expose or calculate standard errors for logistic regression (and other GLMs) #3523