mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
111 stars 27 forks source link

Number of significant genes different with several runs of Pyseer #264

Open Samriddhi0906 opened 8 months ago

Samriddhi0906 commented 8 months ago

After running Pyseer using

pyseer --phenotypes phenotypes.tsv --pres gene_presence_absence.Rtab --similarity phylogeny_similarity.tsv --lmm --covariates covariates.tsv --use-covariates 2 --cpu 8 > $1

and then filtering for significant genes using lrt-pvalue < 0.05 the number of significant genes varies between pyseer runs even though none of the input files have any changes.

In total 7 runs with covariates were run. Within these the lowest number of significant genes is 1245, the highest is 1395. Also, each run has a different number of significant genes.

The expectation would be that each run has the same number of significant genes. When filtering for filter-pvalue <0.05 the number of significant genes is constant.

Additionally, the number of significant genes after using covariates is about twice the number of significant genes without covariates (based on lrt-pvalue, however, they are the same when filtering using filter-pvalue).

Could you help me understand whether this behaviour is expected when running pyseer? Thanks in advance.

mgalardini commented 8 months ago

That comes a bit of a surprise, and this is not what we see in our unit tests, which return the same results every time. One thing I can think of is some stochasticity introduced when using multiple cores. Do you see the same variability when using a single core?

As an aside, a p-value threshold of 0.05 is likely too high, please refer to the docs for suggestions about setting such threshold.

Samriddhi0906 commented 8 months ago

Thanks for your response. I did run it three times with 1 CPU and I still get variable results. wicovariates_cpu1_1.tsv: 6268 wicovariates_cpu1_2.tsv: 6345 wicovariates_cpu1_3.tsv: 6357

As for the p-value threshold, this is just for filtering and comparison to see whether I am getting variable results between runs. For my analysis, I correct it for multiple testing before taking any further steps.

mgalardini commented 4 months ago

Hi, just checking in to see if you are still observing this behavior? I might be able to run some tests if so

Samriddhi0906 commented 4 months ago

Hi,

Yes, I just ran it again and got similar results. If I do not use covariates, the number of significant variants is consistent, but when I do, they vary between runs. Also, the number of significant variants found is higher when using covariates than without.

File wicovariates1.txt: 1487 significant variants File wicovariates2.txt: 1401 significant variants File wicovariates3.txt: 1401 significant variants File wicovariates4.txt: 1468 significant variants File wicovariates5.txt: 1495 significant variants File wocovariates1.txt: 1148 significant variants File wocovariates2.txt: 1148 significant variants File wocovariates3.txt: 1148 significant variants

My script:

When using covariates:

pyseer --phenotypes phenotype.tsv \ --pres gene_presence_absence.Rtab \ --similarity phylogeny_similarity.tsv \ --lmm \ --covariates covariates.tsv \ --use-covariates 2 \ --cpu 1 > $1

Without covariates:

pyseer --phenotypes phenotype.tsv \ --pres gene_presence_absence.Rtab \ --similarity phylogeny_similarity.tsv \ --lmm \ --cpu 8 > $1

mgalardini commented 4 months ago

Could you please share your inputs please? This way I'm sure to be able to reproduce this behavior

Samriddhi0906 commented 4 months ago

Sure!

covariates.tsv https://drive.google.com/file/d/1QWxjEMPi7miMkinCX0mo9X2gmpyCyWal/view?usp=drive_web gene_presence_absence.Rtab https://drive.google.com/file/d/1DPcipJXydIfYLHVv0VNesW0sh2Vpa6wk/view?usp=drive_web phenotype_health_1.tsv https://drive.google.com/file/d/1gqz4MUD1gXNUwN-Ds6ZDjdS_J_2-uf44/view?usp=drive_web phylogeny_similarity.tsv https://drive.google.com/file/d/10ysZUFq1e8pNR3ALB-8901q5mTGF2O3S/view?usp=drive_web

On Thu, 18 Jul 2024 at 14:47, Marco Galardini @.***> wrote:

Could you please share your inputs please? This way I'm sure to be able to reproduce this behavior

— Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/264#issuecomment-2236410926, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7AHEG7AU66S4SH5YZ3VDB3ZM62OHAVCNFSM6AAAAABLADYKG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZWGQYTAOJSGY . You are receiving this because you authored the thread.Message ID: @.***>