`LinAlgError: Singular matrix` in fitting SEER model only occurs with LRR as intercept in `start_vec`

julibeg commented 3 years ago

I ran a SEER model and got 'matrix-inversion-error' for some variants that were actually the top hits when using other methods. Also, the phenotype and their respective genotype vectors were fairly dissimilar (they were all quite sparse though). I did not have the time to look deeper into it, but weirdly mod.fit() in https://github.com/mgalardini/pyseer/blob/21ad8f2f00eb8b17e4de55a1a7f4954be7f3ddf6/pyseer/model.py#L303

only throws LinAlgError: Singular matrix when providing the LLR of the phenotype calculated in https://github.com/mgalardini/pyseer/blob/21ad8f2f00eb8b17e4de55a1a7f4954be7f3ddf6/pyseer/model.py#L299

as initial guess. When choosing another initial guess (or leaving start_params=None) the fit works just fine.

For now, I have simply commented out line #299 as a workaround, which comes with a performance penalty, but gives essentially the same results for all other variants and no matrix-inversion-errors for the 'weird' ones. You could re-arrange the try-block starting at line #283 and run the fit in the except statement with start_params=None to achieve close to the original performance (or use Firth regression -- from the comments in the code it seems like this was the original intention to do for 'nearly separable values' anyway).

johnlees commented 3 years ago

Thanks for the detailed report. I think when we first wrote these models we played with a few different start points: it was difficult to pick something which always worked and was fast for all the possible data points. I also think that our activation of Firth regression remains imperfect.

I'll have a look into adding a try/except to fit again with no intercept as the starting point, or Firth regression (which is slow, but seems very reliable)

julibeg commented 3 years ago

I deleted my previous comment about the runtime when using Firth for all cases in question because my benchmarking was flawed.

julibeg commented 3 years ago

I had another (admittedly cursory) look at this and couldn't really figure out what sets the variants that triggered the matrix-inversion-error apart from the others. However, I tried two simple fixes:

wrap res = mod.fit(start_params=start_vec, ...) in (yet another) try block and call res = mod.fit(start_params=None, ...) in except np.linalg.linalg.LinAlgError (i.e. try the fit again but with zeros as default starting params).
move except np.linalg.linalg.LinAlgError further up to the 'inner' try (i.e. right after except statsmodels.tools.sm_exceptions.PerfectSeparationError in line 324) and set bad_chisq = True in the except block so that Firth regression is performed (just like for perfectly separable data).

Performance-wise there is not much of a difference (since the weird variants are rare anyway). Interestingly, Firth regression actually seems to be slightly faster than mod.fit() with zeros as initial guess. However, the resulting p-values and betas are slightly different.

You can have a look at the code for both variants here (option 1) and here (option 2). If you'd like to go for one, let me know and I make a PR.

johnlees commented 3 years ago

It probably makes sense to have Firth regression as our general fall-back, so I'd be happy to merge option 2 in

mgalardini commented 3 years ago

Thanks a lot for all these contributions, very much appreciated

julibeg commented 3 years ago

I'm a big fan of pyseer, so happy to help!

mgalardini / pyseer

`LinAlgError: Singular matrix` in fitting SEER model only occurs with LRR as intercept in `start_vec` #137