mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

many lrt-filtering-failed with very low filter-pvalues #249

Closed waalkes closed 8 months ago

waalkes commented 9 months ago

I am doing burden testing. here is my latest commandline:

pyseer --lmm --phenotypes all.pheno --similarity 230925_panaroo_core_aln_iqtree_phylogeny_dists.tsv --min-af .0000000000001 --max-missing .9999999 --print-filtered --max-af .9999999 --burden all_HIGH_pyseer_burden_file.txt --vcf all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz --cpu 32 > 230926_all_HIGH_burden_AF_noambig.txt 2>nohup_HIGH_AF_noambig.out

I have 4087 samples. I am using this version of pyseer: pyseer 1.3.11

Can you explain what might exclude a variant based on its lrt value when it has a very low filter-pvalue? The docs say the --lrt-pvalue default is 1 which seems like it should keep everything.

--lrt-pvalue LRT_PVALUE Likelihood ratio test pvalue threshold [Default: 1]

I have filter-pvalues as low as 2.95E-73 that are still excluded based on lrt-value.

I have done other analyses that show some of these genes as good possible candidates.

Any help would be appreciated. Thanks for your efforts on pyseer.

mgalardini commented 9 months ago

What is the lrt-pvalue of those genes that you say are being excluded? And if they are being excluded, how could you know their filter-pvalue (unless you computed it yourself)? Perhaps they fail the allele frequency filtering?

Some examples would help, thanks

waalkes commented 9 months ago

There is no lrt-pvalue (maybe the problem?) Here is part of the output:

variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes cluster-503 0.00E+00 af-filter (37 other like these here) cluster-7757 0.00E+00 af-filter cluster-6 7.34E-04 1.96E-01 lrt-filtering-failed,bad-chisq cluster-12 7.83E-03 9.32E-06 9.54E-01 -6.17E-03 1.06E-01 9.08E-04 cluster-20 3.18E-03 9.66E-04 8.22E-01 -1.01E-01 4.51E-01 3.51E-03 bad-chisq cluster-25 3.04E-01 5.03E-60 6.72E-01 -1.08E-02 2.54E-02 6.63E-03 removed 10 for simplification cluster-50 9.91E-02 1.80E-03 9.89E-01 1.18E-03 8.90E-02 2.08E-04 cluster-51 1.17E-01 1.27E-02 lrt-filtering-failed cluster-53 1.15E-01 6.61E-03 5.94E-01 3.44E-02 6.44E-02 8.35E-03 cluster-87 4.94E-02 5.36E-32 lrt-filtering-failed

mgalardini commented 9 months ago

can you send a tsv? it is rather difficult to read your table. at any rate when you see af-filter is because the gene did not pass the allele frequency threshold

waalkes commented 9 months ago

230926_all_HIGH_burden_AF_noambig.txt

attached. here was the output:

Read 4087 phenotypes Detected binary phenotype Setting up LMM Similarity matrix has dimension (5006, 5006) Analysing 4087 samples found in both phenotype and similarity matrix h^2 = 1.00 No observations of cluster-503 in selected samples No observations of cluster-505 in selected samples No observations of cluster-1165 in selected samples No observations of cluster-1631 in selected samples No observations of cluster-1632 in selected samples No observations of cluster-1922 in selected samples No observations of cluster-2013 in selected samples No observations of cluster-2446 in selected samples No observations of cluster-2643 in selected samples No observations of cluster-2762 in selected samples No observations of cluster-2917 in selected samples No observations of cluster-3004 in selected samples No observations of cluster-3034 in selected samples No observations of cluster-3105 in selected samples No observations of cluster-3114 in selected samples No observations of cluster-3138 in selected samples No observations of cluster-3249 in selected samples No observations of cluster-3378 in selected samples No observations of cluster-3720 in selected samples No observations of cluster-3996 in selected samples No observations of cluster-4088 in selected samples No observations of cluster-4298 in selected samples No observations of cluster-4300 in selected samples No observations of cluster-4392 in selected samples No observations of cluster-4414 in selected samples No observations of cluster-5143 in selected samples No observations of cluster-5319 in selected samples No observations of cluster-6258 in selected samples No observations of cluster-6259 in selected samples No observations of cluster-6524 in selected samples No observations of cluster-6713 in selected samples No observations of cluster-7092 in selected samples No observations of cluster-7531 in selected samples No observations of cluster-7571 in selected samples No observations of cluster-7719 in selected samples No observations of cluster-7757 in selected samples 1597 loaded variants 36 pre-filtered variants 1561 tested variants 1597 printed variants

waalkes commented 8 months ago

With the tsv, any idea what it going on here?

waalkes commented 8 months ago

Let me know if you want me to transfer the files to you to see if you can run this yourself. The similarity file is huge but you could recreate from the tree file like this:

python ~/pyseer/scripts/phylogeny_distance.py --lmm 230922_panaroo_core_aln_iqtree.nwk > 230925_panaroo_core_aln_iqtree_phylogeny_dists.tsv

Here are the file sizes: -rw-rw-r-- 1 waalkes domain^users 227222 Sep 25 06:47 230922_panaroo_core_aln_iqtree.nwk -rw-rw-r-- 1 waalkes domain^users 38821 Sep 25 07:07 all_HIGH_pyseer_burden_file.txt -rw-rw-r-- 1 waalkes domain^users 1663140 Sep 27 12:05 all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz -rw-rw-r-- 1 waalkes domain^users 113 Sep 27 12:06 all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz.tbi

waalkes commented 8 months ago

I figured this out. There were a number of ambiguous genotypes and that was what was causes the error. I changed all .:'s to 0:'s in my vcf and the lrt errors went away. I don't know if there is a bug here or not but at least the error message could be improved? You can close this. Thanks

johnlees commented 8 months ago

Thanks for detailing your solution. I suspect the many different formats of VCF make it difficult for us to give good error messages in every case, but we'll keep an eye out for similar problems and implement a better error if this occurs again.