mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

ValueError when input has variants that don't occur in a single sample #153

Open julibeg opened 3 years ago

julibeg commented 3 years ago

When fitting an elastic net on .Rtab files that have columns with only zeros, pyseer gives the warning No observations of [variant] in selected samples, but the program completes. Therefore, I assumed it would somehow simply ignore those variants. However, when running SEER, scipy.stats.chi2_contingecy() throws ValueError: The internally computed table of expected frequencies has a zero element at (0, 0).

Even though it is relatively obvious of course, on first glance it was not clear to me that those variants indeed caused the error and that I should remove them from the input. In case this behaviour (i.e. throwing an error and not filtering those variants internally) is intended, could the warning be changed to more explicitly state that such variants are not allowed?

johnlees commented 3 years ago

Right, I guess this can happen if the frequency filtering is turned off. I added the 'no observations' warning as when the sample labels mismatched every variant will trigger this and it's obvious something is wrong, whereas without it and using typical MAF filters it will just go through and ignore every variant, leaving you with an empty output.

We should probably skip these all 0/all 1 variants rather than leaving the allele filtering to sort them out though, I think we do that in the elastic net anyway

julibeg commented 3 years ago

Ah, I didn't realise that this usually wouldn't happen due to the MAF filters. Thanks for clarifying!