Open ktmeaton opened 1 month ago
Thanks @ktmeaton for the detailed explanation, reproducible example, and likely fix -- this is all really really helpful!
Strictly, p-values have different interpretations with different N, so dropping some values could be misleading. But this is commonly done in GWAS so I think your solution is sensible. Alternatives are setting all missing data to:
Major (i.e. 1 if AF > 0.5, 0 otherwise) is my usual preference due to simplicity and accuracy over the first two.
@mgalardini what do you think? We could also provide an option for this behaviour. But my feeling is we should just choose between dropping missing values as suggested, or imputing the major allele (which I think is what the fixed effects model does?)
If given the option, I would lean towards dropping missing values as the default. Simply because that will mean all missing data is treated the same (dropped) as opposed to adding in the imputation as another source of variability that might complicate interpretations. I think dropping missing values is also what scoary does?
When running the
lmm
model to detect lineage effects, I sometimes get astatsmodels
error because theendog
variable containsnan
values:It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:
https://github.com/mgalardini/pyseer/blob/4b8d22f43bc5943483d9a54df1e22c6a35cd0121/pyseer/model.py#L181
to:
To Reproduce
I'm running
pyseer v1.3.12
from conda. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for both locus effects and lineage effects.And here is the output and traceback:
Once I add the
missing='drop'
parameter to theLogit
model, it finishes successfully without errors:Is this error reproducible for you, and does the suggested fix make sense?
Thanks, Katherine