jzluo / firthlogist

A Python implementation of Firth Logistic Regression
MIT License
5 stars 3 forks source link

Cannot fit data: Newton-Raphson does not converge #17

Open Hoeze opened 1 year ago

Hoeze commented 1 year ago

Hi @jzluo, I found a sample dataset which fails fitting. Consider the example data I attached here: sample.zip

#! unzip sample.zip
with open(r"X.pickle", "rb") as fd:
    X = pickle.load(fd)
with open(r"y.pickle", "rb") as fd:
    y = pickle.load(fd)

from firthlogist import FirthLogisticRegression
FirthLogisticRegression().fit(X, y)

This fails after the second round of optimization because all coefficients get NaN.

Detailed logs + stack trace:

```python3 /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:153: ConvergenceWarning: Firth logistic regression failed to converge. Try increasing max_iter. /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:2154: RuntimeWarning: invalid value encountered in det --------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) Cell In[164], line 2 1 from firthlogist import FirthLogisticRegression ----> 2 FirthLogisticRegression().fit(X, y) File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:161, in FirthLogisticRegression.fit(self, X, y) 159 if not self.skip_ci: 160 if not self.wald: --> 161 self.ci_ = _profile_likelihood_ci( 162 X=X, 163 y=y, 164 fitted_coef=self.coef_, 165 full_loglik=self.loglik_, 166 max_iter=self.pl_max_iter, 167 max_stepsize=self.pl_max_stepsize, 168 max_halfstep=self.pl_max_halfstep, 169 tol=self.tol, 170 alpha=self.alpha, 171 test_vars=self.test_vars, 172 ) 173 else: 174 self.ci_ = _wald_ci(self.coef_, self.bse_, self.alpha) File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:440, in _profile_likelihood_ci(X, y, fitted_coef, full_loglik, max_iter, max_stepsize, max_halfstep, tol, alpha, test_vars) 438 U_star = np.matmul(X.T, y - preds + np.multiply(hat, 0.5 - preds)) 439 # https://github.com/georgheinze/logistf/blob/master/src/logistf.c#L780-L781 --> 440 inv_fisher = np.linalg.pinv(fisher_info_mtx) 441 tmp1x1 = U_star @ np.negative(inv_fisher) @ U_star 442 underRoot = ( 443 -2 444 * ((LL0 - loglike) + 0.5 * tmp1x1) 445 / (inv_fisher[coef_idx, coef_idx]) 446 ) File <__array_function__ internals>:180, in pinv(*args, **kwargs) File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:1998, in pinv(a, rcond, hermitian) 1996 return wrap(res) 1997 a = a.conjugate() -> 1998 u, s, vt = svd(a, full_matrices=False, hermitian=hermitian) 2000 # discard small singular values 2001 cutoff = rcond[..., newaxis] * amax(s, axis=-1, keepdims=True) File <__array_function__ internals>:180, in svd(*args, **kwargs) File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:1657, in svd(a, full_matrices, compute_uv, hermitian) 1654 gufunc = _umath_linalg.svd_n_s 1656 signature = 'D->DdD' if isComplexType(t) else 'd->ddd' -> 1657 u, s, vh = gufunc(a, signature=signature, extobj=extobj) 1658 u = u.astype(result_t, copy=False) 1659 s = s.astype(_realType(result_t), copy=False) File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:98, in _raise_linalgerror_svd_nonconvergence(err, flag) 97 def _raise_linalgerror_svd_nonconvergence(err, flag): ---> 98 raise LinAlgError("SVD did not converge") LinAlgError: SVD did not converge ```
Hoeze commented 1 year ago

@jzluo I debugged a bit and found that the issue is the determinant getting negative which results in NaN's in the penalty. I tried to fix that in #18.

Another thing I found: With FirthLogisticRegression(skip_ci=True, wald=True).fit(X, y) the fitting works. Otherwise, the regression runs in an infinite loop during CI/pvalue calculation if I do not set max_halfsteps > 0.

Still, the NR steps do not converge even after 200 iterations.

jzluo commented 1 year ago

Thanks so much! I will take a closer look tonight.

jzluo commented 1 year ago

Sorry, things have been a bit hectic for me and it may be longer before I can sit down and dig into this. Just as an initial sanity check, I tried your sample data with logistf and brglm2. Fitting fails in logistf, and the results for brglm2:

Call:
glm(formula = y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + 
    X10 + X11 + X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + 
    X20 + X21 + X22 + X23 + X24 + X25 + X26 + X27 + X28 + X29 + 
    X30 + X31 + X32 + X33 + X34, family = binomial(logit), data = data, 
    method = "brglmFit")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6170  -0.4648  -0.3425  -0.2405   3.3041  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.379e+00  1.144e-01 -38.281  < 2e-16 ***
X1          -3.372e-01  1.347e-01  -2.504 0.012296 *  
X2           1.067e-02  1.486e-03   7.180 6.95e-13 ***
X3           6.447e-04  2.326e-03   0.277 0.781616    
X4           8.806e-03  5.879e-03   1.498 0.134193    
X5           1.978e-03  6.121e-03   0.323 0.746594    
X6          -9.555e-03  5.908e-03  -1.617 0.105788    
X7           9.907e-03  4.557e-03   2.174 0.029729 *  
X8          -2.614e-03  1.972e-03  -1.325 0.185152    
X9           5.465e-03  5.678e-03   0.962 0.335837    
X10          4.911e-03  5.117e-03   0.960 0.337250    
X11          2.094e-03  5.129e-03   0.408 0.683103    
X12         -5.939e-03  2.064e-03  -2.877 0.004015 ** 
X13         -4.947e-03  4.903e-03  -1.009 0.312996    
X14          6.875e-03  3.670e-03   1.873 0.061015 .  
X15         -1.744e-02  4.920e-03  -3.544 0.000394 ***
X16         -6.035e-03  5.629e-03  -1.072 0.283699    
X17         -1.795e-02  2.999e-03  -5.986 2.15e-09 ***
X18         -2.030e-03  5.223e-03  -0.389 0.697615    
X19         -7.145e-03  3.056e-03  -2.338 0.019371 *  
X20         -3.240e-03  4.584e-03  -0.707 0.479723    
X21          2.543e-03  3.137e-03   0.811 0.417493    
X22         -3.022e-03  3.157e-03  -0.957 0.338335    
X23          2.704e-03  3.184e-03   0.849 0.395848    
X24          7.830e-03  3.291e-03   2.379 0.017356 *  
X25         -3.968e-03  3.175e-03  -1.250 0.211373    
X26         -6.351e-03  3.194e-03  -1.989 0.046745 *  
X27          5.167e-03  3.549e-03   1.456 0.145381    
X28          1.475e-03  3.346e-03   0.441 0.659453    
X29          5.652e-05  3.306e-03   0.017 0.986360    
X30          2.458e-03  3.278e-03   0.750 0.453359    
X31         -5.930e-03  3.310e-03  -1.792 0.073172 .  
X32         -1.083e-03  3.294e-03  -0.329 0.742266    
X33          4.086e-03  3.219e-03   1.269 0.204388    
X34          1.131e+05  1.224e+03  92.404  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 98119  on 161633  degrees of freedom
Residual deviance: 88017  on 161599  degrees of freedom
AIC:  88087

Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 3

Can I ask what kind of data this is?

Hoeze commented 1 year ago

Unfortunately, I cannot tell you what kind of data this is. But the fitting succeeds if you standard-normalize all features in X.

Without normalizing and using the default statsmodels logistic regression, you need L-BFGS-B to get a good fit. Newton-Raphson also fails to converge there.

So probably some L-BFGS-B implementation would help here a lot? This would also make firthlogist scale better to large numbers of variables.