Closed julibeg closed 3 years ago
Thanks for the detailed report. I think when we first wrote these models we played with a few different start points: it was difficult to pick something which always worked and was fast for all the possible data points. I also think that our activation of Firth regression remains imperfect.
I'll have a look into adding a try/except to fit again with no intercept as the starting point, or Firth regression (which is slow, but seems very reliable)
I deleted my previous comment about the runtime when using Firth for all cases in question because my benchmarking was flawed.
I had another (admittedly cursory) look at this and couldn't really figure out what sets the variants that triggered the matrix-inversion-error apart from the others. However, I tried two simple fixes:
res = mod.fit(start_params=start_vec, ...)
in (yet another) try
block and call res = mod.fit(start_params=None, ...)
in except np.linalg.linalg.LinAlgError
(i.e. try the fit again but with zeros as default starting params).except np.linalg.linalg.LinAlgError
further up to the 'inner' try
(i.e. right after except statsmodels.tools.sm_exceptions.PerfectSeparationError
in line 324) and set bad_chisq = True
in the except
block so that Firth regression is performed (just like for perfectly separable data).Performance-wise there is not much of a difference (since the weird
variants are rare anyway). Interestingly, Firth regression actually seems to be slightly faster than mod.fit()
with zeros as initial guess. However, the resulting p-values and betas are slightly different.
You can have a look at the code for both variants here (option 1) and here (option 2). If you'd like to go for one, let me know and I make a PR.
It probably makes sense to have Firth regression as our general fall-back, so I'd be happy to merge option 2 in
Thanks a lot for all these contributions, very much appreciated
I'm a big fan of pyseer, so happy to help!
I ran a SEER model and got
'matrix-inversion-error'
for some variants that were actually the top hits when using other methods. Also, the phenotype and their respective genotype vectors were fairly dissimilar (they were all quite sparse though). I did not have the time to look deeper into it, but weirdlymod.fit()
in https://github.com/mgalardini/pyseer/blob/21ad8f2f00eb8b17e4de55a1a7f4954be7f3ddf6/pyseer/model.py#L303only throws
LinAlgError: Singular matrix
when providing the LLR of the phenotype calculated in https://github.com/mgalardini/pyseer/blob/21ad8f2f00eb8b17e4de55a1a7f4954be7f3ddf6/pyseer/model.py#L299as initial guess. When choosing another initial guess (or leaving
start_params=None
) the fit works just fine.For now, I have simply commented out line #299 as a workaround, which comes with a performance penalty, but gives essentially the same results for all other variants and no
matrix-inversion-errors
for the 'weird' ones. You could re-arrange thetry
-block starting at line #283 and run the fit in theexcept
statement withstart_params=None
to achieve close to the original performance (or use Firth regression -- from the comments in the code it seems like this was the original intention to do for 'nearly separable values' anyway).