mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Number of significant genes different with several runs of Pyseer #264

Open Samriddhi0906 opened 4 months ago

Samriddhi0906 commented 4 months ago

After running Pyseer using

pyseer --phenotypes phenotypes.tsv --pres gene_presence_absence.Rtab --similarity phylogeny_similarity.tsv --lmm --covariates covariates.tsv --use-covariates 2 --cpu 8 > $1

and then filtering for significant genes using lrt-pvalue < 0.05 the number of significant genes varies between pyseer runs even though none of the input files have any changes.

In total 7 runs with covariates were run. Within these the lowest number of significant genes is 1245, the highest is 1395. Also, each run has a different number of significant genes.

The expectation would be that each run has the same number of significant genes. When filtering for filter-pvalue <0.05 the number of significant genes is constant.

Additionally, the number of significant genes after using covariates is about twice the number of significant genes without covariates (based on lrt-pvalue, however, they are the same when filtering using filter-pvalue).

Could you help me understand whether this behaviour is expected when running pyseer? Thanks in advance.

mgalardini commented 4 months ago

That comes a bit of a surprise, and this is not what we see in our unit tests, which return the same results every time. One thing I can think of is some stochasticity introduced when using multiple cores. Do you see the same variability when using a single core?

As an aside, a p-value threshold of 0.05 is likely too high, please refer to the docs for suggestions about setting such threshold.

Samriddhi0906 commented 3 months ago

Thanks for your response. I did run it three times with 1 CPU and I still get variable results. wicovariates_cpu1_1.tsv: 6268 wicovariates_cpu1_2.tsv: 6345 wicovariates_cpu1_3.tsv: 6357

As for the p-value threshold, this is just for filtering and comparison to see whether I am getting variable results between runs. For my analysis, I correct it for multiple testing before taking any further steps.