vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Fisher exact test pvalue and multiple testing #139

Open freedomq8 opened 4 years ago

freedomq8 commented 4 years ago

Hi there,

I am using the fisher exact test following the instructions on http://varianttools.sourceforge.net/Association/ExactTest . I am using it for autosomal and mitochondrial variants and wanted to know if the p-value under prop_pval column is it adjusted for multiple testing and/or Bonferroni corrected ? Thanks

gaow commented 4 years ago

@freedomq8 all association tests are nominal p-values. You can take them and do your own multiple testing control (Bonferroni or FDR)

freedomq8 commented 4 years ago

Thanks for the quick response. The way I correct it is dividing 0.05/total number of variants and that would be the threshold. A follow up question; I want to do fisher exact test on mitochondrial variants (mitochondria is haploid although heteroplasmy do exists) and wanted to know if the current fisher test in vtools can be used for that or I need to change amend it ?

thanks

gaow commented 4 years ago

Current Fisher test assumes additive model so it essentially just counts minor alleles. As long as for a variant under the null your samples are homogeneous in terms of ploidy (ie, for the same variant they are all haploid) it should be fine. For case with heteroplasmy I'm thinking maybe those are the events you are interested in detecting -- I'm curious what'd the numeric coding for genotypes be like for a sample with different mtDNA allele for a variant, due to heteroplasmy. what does the data look like?

freedomq8 commented 4 years ago

thanks for getting back to me. looking at the equation of the fisher test: Fisher_exact(num_var_alleles_case, num_var_alleles_ctrl, 2num_gt_case, 2num_gt_ctrl). its counting the number of variant alleles for both cases and control. so the question is since mitochondrial variants are majority homozygous as its haploid wouldn't that effect the p value? and yes all the samples should be homogeneous so doesn't that mean it should be fine ? heteroplasmpy variants are not of interest of my fields, these can be not true due to mapping/caller tool issues or true.

`

chr pos ref alt maf refGene_name2 het hom mut_type num_var_alleles_case num_gt_case num_var_alleles_ctrl num_gt_ctrl prop_pval hom_case het_case hom_ctrl het_ctrl total
MT 6419 A C 0.035484 . 22 0 . 14 532 8 88 0.007624 0 14 0 8 620
MT 7521 G A 0.107492 . 12 27 . 57 527 9 87 1 24 9 3 3 614
MT 146 T C 0.291935 . 3 89 . 141 532 40 88 0.001735 70 1 19 2 620
MT 16183 - C 0.038596 . 10 6 . 16 488 6 82 0.11482 4 8 2 2 570

` data sorted by heteroplasmy so you can have an idea but they are not the common but they do exists. On igv they would look like regular autosomal heterozgyous some reads wildtype some they have snp. heteroplasmy can be inherited but with different percentages in mitochondrial population or acquired with age.

freedomq8 commented 4 years ago

it would be a nice feature do include for mitochondria a multivariate regression for association test. Our station is doing this tests on R but i have to change the VCF genotype and none of the tests results in signficant results including fisher test. But i do get significant results with vtools some am not sure where causing this; is it the way he is doing analysis or is it fisher test in vtools not optimised for mitochondrial variants

gaow commented 4 years ago

and yes all the samples should be homogeneous so doesn't that mean it should be fine ?

You mean the genotype coding is either 0 or 2 for haploid, and is 0, 1 or 2 for heteroplasmpy? then that's problematic for haploid, because you count each haplotype twice. It in effect doubled your sample size in a biased way. For example:

> fisher.test(matrix(c(0,10,10,10),2,2))$p.value
[1] 0.01099374
> fisher.test(matrix(c(0,5,5,5),2,2))$p.value
[1] 0.1008991

You see p-values are different. I think you need to do something like hom_count/2 to use with your Fisher's test. For heteroplasmpy it should work as is although you are not interested in them. So maybe even filter those variants before you apply /2 or ignore them when you interpret the results?

include for mitochondria a multivariate regression for association test.

what do you mean by multivariate regression -- you mean multiple regression? For regression analysis allele coding will not matter to p-value although your regression coefficient will have different interpretation.

freedomq8 commented 4 years ago

This is the vcf for the mitochondria variants called with varscan, gatk haplocaller, and strelka the have the same format of vcf. below a snp where some samples are wildtype, missing,homoplasmy,heterplasmy Example 1: `

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SRR2125286 | SRR2125287 | SRR2130735 | SRR2130776 | SRR2130779 | SRR2130862

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- MT | 57 | . | T | TC,C | 115249.7 | . | AC=85,4;AF=0.050,2.331e-03;AN=1716;BaseQRankSum=-1.479e+00;ClippingRankSum=0.00;DP=32708;ExcessHet=0.0000;FS=1.413;InbreedingCoeff=0.9862;MLEAC=86,4;MLEAF=0.050,2.331e-03;MQ=55.62;MQRankSum=0.831;NDA=2;QD=30.83;ReadPosRankSum=0.466;SOR=1.077 | GT:AD:DP:GQ:PGT:PID:PL | ./.:24,0,0:24:.:.:.:0,0,0,0,0,0 | 0/0:47,0,0:47:99:.:.:0,106,1440,106,1440,1440 | ./.:44,0,0:44:.:.:.:0,0,0,0,0,0 | 1/1:0,65,0:65:99:1|1:57_T_TC:2908,202,0,2908,202,2908 | 2/2:0,0,115:115:99:1|1:55_T_C:5175,5175,5175,346,346,0 | 0/1:9,17,0:26:99:0|1:57_T_TC:527,0,243,554,294,849

` homoplasmy 1/1 , heteroplasmy 0/1 (or 0/2 ,1/2 for allele with two alternate alleles for more the one sample) and wildtype 0/0 and missing ./.

Example 2: another caller tool gatk mutect2 used but use its with vtools yet has a different VCF format: homoplasmy and heteroplasmy have the same genotype code 0/1 or 0|1 and wild type 0/0 and missing ./.:

so will example 1 is compatible with current fisher test or I need to adjust for it. and any feedback on example 2 with vtools?

freedomq8 commented 4 years ago

for example 1: if i need to change the forumla for haploid, what shall i change in the codes below ?

`vtools update variant --from_stat 'total=#(GT)' 'num=#(alt)' 'het=#(het)' 'hom=#(hom)' 'other=#(other)' 'minDP=min(DP_geno)' 'maxDP=max(DP_geno)' 'meanDP=avg(DP_geno)' 'maf=maf()' 'missing=#(missing)'

vtools update variant --from_stat 'num_gt_case=#(GT)' 'num_var_alleles_case=#(alt)' --samples phenotype=1

vtools update variant --from_stat "num_gt_ctrl=#(GT)" "num_var_alleles_ctrl=#(alt)" --samples phenotype=0

vtools update variant --from_stat 'hom_case=#(hom)' 'het_case=#(het)' --samples phenotype=1

vtools update variant --from_stat "hom_ctrl=#(hom)" "het_ctrl=#(het)" --samples phenotype=0

vtools update variant --set "prop_pval=Fisher_exact(num_var_alleles_case, num_var_alleles_ctrl, 2num_gt_case, 2num_gt_ctrl)"`

Also here is the top p-value with their genotype and size number: `

chr pos ref alt maf refGene_name2 het hom mut_type num_var_alleles_case num_gt_case num_var_alleles_ctrl num_gt_ctrl prop_pval hom_case het_case hom_ctrl het_ctrl total
MT 279 T C 0.009649 . 1 5 . 11 377 0 763 4.93E-06 5 1 0 0 1140
MT 15169 A G 0.020833 . 0 12 . 18 379 6 773 2.67E-05 9 0 3 0 1152
MT 5501 A G 0.034662 . 0 20 . 26 379 14 775 4.88E-05 13 0 7 0 1154

` the size of samples is correct Appreciate your help and feedback

freedomq8 commented 4 years ago

Regardin the regression test its Multivariate (adjusting for covariate i.e sex, age and haplogroups) logistic regression analysis. used in this paper (4th figure)

https://www.researchgate.net/publication/230666692_Mitochondrial_DNA_Coding_and_Control_Region_Variants_as_Genetic_Risk_Factors_for_Type_2_Diabetes/figures?lo=1