Closed waalkes closed 8 months ago
What is the lrt-pvalue of those genes that you say are being excluded? And if they are being excluded, how could you know their filter-pvalue (unless you computed it yourself)? Perhaps they fail the allele frequency filtering?
Some examples would help, thanks
There is no lrt-pvalue (maybe the problem?) Here is part of the output:
variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes cluster-503 0.00E+00 af-filter (37 other like these here) cluster-7757 0.00E+00 af-filter cluster-6 7.34E-04 1.96E-01 lrt-filtering-failed,bad-chisq cluster-12 7.83E-03 9.32E-06 9.54E-01 -6.17E-03 1.06E-01 9.08E-04 cluster-20 3.18E-03 9.66E-04 8.22E-01 -1.01E-01 4.51E-01 3.51E-03 bad-chisq cluster-25 3.04E-01 5.03E-60 6.72E-01 -1.08E-02 2.54E-02 6.63E-03 removed 10 for simplification cluster-50 9.91E-02 1.80E-03 9.89E-01 1.18E-03 8.90E-02 2.08E-04 cluster-51 1.17E-01 1.27E-02 lrt-filtering-failed cluster-53 1.15E-01 6.61E-03 5.94E-01 3.44E-02 6.44E-02 8.35E-03 cluster-87 4.94E-02 5.36E-32 lrt-filtering-failed
can you send a tsv? it is rather difficult to read your table. at any rate when you see af-filter
is because the gene did not pass the allele frequency threshold
230926_all_HIGH_burden_AF_noambig.txt
attached. here was the output:
Read 4087 phenotypes Detected binary phenotype Setting up LMM Similarity matrix has dimension (5006, 5006) Analysing 4087 samples found in both phenotype and similarity matrix h^2 = 1.00 No observations of cluster-503 in selected samples No observations of cluster-505 in selected samples No observations of cluster-1165 in selected samples No observations of cluster-1631 in selected samples No observations of cluster-1632 in selected samples No observations of cluster-1922 in selected samples No observations of cluster-2013 in selected samples No observations of cluster-2446 in selected samples No observations of cluster-2643 in selected samples No observations of cluster-2762 in selected samples No observations of cluster-2917 in selected samples No observations of cluster-3004 in selected samples No observations of cluster-3034 in selected samples No observations of cluster-3105 in selected samples No observations of cluster-3114 in selected samples No observations of cluster-3138 in selected samples No observations of cluster-3249 in selected samples No observations of cluster-3378 in selected samples No observations of cluster-3720 in selected samples No observations of cluster-3996 in selected samples No observations of cluster-4088 in selected samples No observations of cluster-4298 in selected samples No observations of cluster-4300 in selected samples No observations of cluster-4392 in selected samples No observations of cluster-4414 in selected samples No observations of cluster-5143 in selected samples No observations of cluster-5319 in selected samples No observations of cluster-6258 in selected samples No observations of cluster-6259 in selected samples No observations of cluster-6524 in selected samples No observations of cluster-6713 in selected samples No observations of cluster-7092 in selected samples No observations of cluster-7531 in selected samples No observations of cluster-7571 in selected samples No observations of cluster-7719 in selected samples No observations of cluster-7757 in selected samples 1597 loaded variants 36 pre-filtered variants 1561 tested variants 1597 printed variants
With the tsv, any idea what it going on here?
Let me know if you want me to transfer the files to you to see if you can run this yourself. The similarity file is huge but you could recreate from the tree file like this:
python ~/pyseer/scripts/phylogeny_distance.py --lmm 230922_panaroo_core_aln_iqtree.nwk > 230925_panaroo_core_aln_iqtree_phylogeny_dists.tsv
Here are the file sizes: -rw-rw-r-- 1 waalkes domain^users 227222 Sep 25 06:47 230922_panaroo_core_aln_iqtree.nwk -rw-rw-r-- 1 waalkes domain^users 38821 Sep 25 07:07 all_HIGH_pyseer_burden_file.txt -rw-rw-r-- 1 waalkes domain^users 1663140 Sep 27 12:05 all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz -rw-rw-r-- 1 waalkes domain^users 113 Sep 27 12:06 all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz.tbi
I figured this out. There were a number of ambiguous genotypes and that was what was causes the error. I changed all .:'s to 0:'s in my vcf and the lrt errors went away. I don't know if there is a bug here or not but at least the error message could be improved? You can close this. Thanks
Thanks for detailing your solution. I suspect the many different formats of VCF make it difficult for us to give good error messages in every case, but we'll keep an eye out for similar problems and implement a better error if this occurs again.
I am doing burden testing. here is my latest commandline:
pyseer --lmm --phenotypes all.pheno --similarity 230925_panaroo_core_aln_iqtree_phylogeny_dists.tsv --min-af .0000000000001 --max-missing .9999999 --print-filtered --max-af .9999999 --burden all_HIGH_pyseer_burden_file.txt --vcf all_cluster_HIGH_sorted_AAA_normed_noambig.vcf.gz --cpu 32 > 230926_all_HIGH_burden_AF_noambig.txt 2>nohup_HIGH_AF_noambig.out
I have 4087 samples. I am using this version of pyseer: pyseer 1.3.11
Can you explain what might exclude a variant based on its lrt value when it has a very low filter-pvalue? The docs say the --lrt-pvalue default is 1 which seems like it should keep everything.
--lrt-pvalue LRT_PVALUE Likelihood ratio test pvalue threshold [Default: 1]
I have filter-pvalues as low as 2.95E-73 that are still excluded based on lrt-value.
I have done other analyses that show some of these genes as good possible candidates.
Any help would be appreciated. Thanks for your efforts on pyseer.