mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
111 stars 27 forks source link

ValueError: endog must be in the unit interval. #280

Open ktmeaton opened 1 month ago

ktmeaton commented 1 month ago

When running the lmm model to detect lineage effects, I sometimes get a statsmodels error because the endog variable contains nan values:

ValueError: endog must be in the unit interval.

It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:

https://github.com/mgalardini/pyseer/blob/4b8d22f43bc5943483d9a54df1e22c6a35cd0121/pyseer/model.py#L181

to:

 lineage_mod = smf.Logit(k, X, missing='drop') 

To Reproduce

I'm running pyseer v1.3.12 from conda. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for both locus effects and lineage effects.

pyseer \
  --lmm \
  --pres variants.txt \
  --phenotypes phenotypes.txt \
  --phenotype-column penicillin \
  --cpu 1 \
  --similarity similarities.txt \
  --output-patterns penicillin.locus_effects.patterns.txt \
  --lineage \
  --lineage-clusters lineages.txt \
  --lineage-file penicillin.lineage_effects.tsv \
  --distances distances.txt

And here is the output and traceback:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/pyseer-1.3.12/bin/pyseer", line 10, in <module>
    sys.exit(main())
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/__main__.py", line 567, in main
    ret = fit_lmm(*data)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/lmm.py", line 210, in fit_lmm
    max_lineage = fit_lineage_effect(lineage_clusters,
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/model.py", line 181, in fit_lineage_effect
    lineage_mod = smf.Logit(k, X)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/statsmodels/discrete/discrete_model.py", line 479, in __init__
    raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.

Once I add the missing='drop' parameter to the Logit model, it finishes successfully without errors:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
variant_0       6.67E-02        2.05E-01        3.30E-01        3.50E-01        3.46E-01        2.70E-01        3      bad-chisq
variant_1       2.00E-01        1.77E-02        5.86E-02        5.01E-01        2.42E-01        4.98E-01        3      bad-chisq
variant_2       3.33E-01        2.53E-02        3.93E-01        3.46E-01        3.92E-01        2.38E-01        3      bad-chisq
variant_3       4.00E-01        5.16E-03        7.87E-02        6.20E-01        3.25E-01        4.68E-01        3      bad-chisq
variant_4       1.33E-01        2.15E-01        5.57E-01        -1.67E-01       2.77E-01        1.65E-01        3      bad-chisq
variant_5       1.33E-01        2.15E-01        8.33E-01        -5.80E-02       2.69E-01        5.96E-02        3      bad-chisq
variant_6       2.00E-01        1.14E-01        7.23E-01        -9.05E-02       2.50E-01        1.00E-01        3      bad-chisq
variant_8       8.67E-01        6.28E-02        3.37E-01        -5.06E-01       5.07E-01        2.67E-01        3      bad-chisq
10 loaded variants
2 pre-filtered variants
8 tested variants
8 printed variants

Is this error reproducible for you, and does the suggested fix make sense?

Thanks, Katherine

johnlees commented 4 weeks ago

Thanks @ktmeaton for the detailed explanation, reproducible example, and likely fix -- this is all really really helpful!

Strictly, p-values have different interpretations with different N, so dropping some values could be misleading. But this is commonly done in GWAS so I think your solution is sensible. Alternatives are setting all missing data to:

Major (i.e. 1 if AF > 0.5, 0 otherwise) is my usual preference due to simplicity and accuracy over the first two.

@mgalardini what do you think? We could also provide an option for this behaviour. But my feeling is we should just choose between dropping missing values as suggested, or imputing the major allele (which I think is what the fixed effects model does?)

ktmeaton commented 1 week ago

If given the option, I would lean towards dropping missing values as the default. Simply because that will mean all missing data is treated the same (dropped) as opposed to adding in the imputation as another source of variability that might complicate interpretations. I think dropping missing values is also what scoary does?