SnpSift CaseControl - statistics clarity

You can use the "debug" command line option to have a slightly more detailed explanation on how this is calculated:

$ java -jar SnpSift.jar \
        caseControl \
        -d \
        "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++0000000000000000000000000000-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" \
        1_55102593_G_A_example.vcf \
        2>&1 | tee 1_55102593_G_A_example.out

How to read the debug output

Looking at the output file (1_55102593_G_A_example.out) you'll see:

# Note: Edited for readability
...
# VCF header (removed)
...

# This section shows whether a sample is 'case' 'control' or ignored:
    Sample  Case
    SEV_39795_PN124_PCSK9-4002  true
    SEV_39796_PN126_PCSK9-4001  true
    SEV_39798_PN128_PCSK9-4004  true
    SEV_39799_PN129_PCSK9-4003  true
    SEV_39800_PN130_PCSK9-4504  true
    SEV_39801_PN132_PCSK9-4501  true
    SEV_39803_PN135_PCSK9-4503  true
    SEV_39813_PN143_APOcIII-42  true
...

The next part, shows the input VCF line, followed by genotype (GT) counts for 'cases' and 'controls'. Note that the counts are summarized as [Hom_Ref , Het, Hom_Alt]

SnpSiftCmdCaseControl.annotate(66): chr1    196830683   1:55102593:G:A;rs140799744;915864   G   A   315903.0    PASS    AC=185;AF=0.336;AN=550;BaseQRankSum=-0.161;CSQ=A|intron_variant|MODIFIER||ENSMFAG00000039896|Transcript|ENSMFAT0000001
    Sample:   0 Type: Case      GT code: +0 nCases: [  1,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:46,0:46:48:.:.:0,48,720
    Sample:   1 Type: Case      GT code: +0 nCases: [  2,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:90,0:90:99:.:.:0,120,1800
    Sample:   2 Type: Case      GT code: +0 nCases: [  3,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:33,0:33:99:.:.:0,99,1302
    Sample:   3 Type: Case      GT code: +0 nCases: [  4,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:75,0:75:99:.:.:0,120,1800
    Sample:   4 Type: Case      GT code: +0 nCases: [  5,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:75,0:75:99:.:.:0,120,1800
    Sample:   5 Type: Case      GT code: +0 nCases: [  6,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:53,0:53:99:.:.:0,118,1800
    Sample:   6 Type: Case      GT code: +0 nCases: [  7,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:56,0:56:99:.:.:0,111,1665
    Sample:   7 Type: Case      GT code: +0 nCases: [  8,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:11,0:11:12:.:.:0,12,180
    Sample:   8 Type: Case      GT code: +0 nCases: [  9,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:28,0:28:30:.:.:0,30,450
...

Immediately after that, the detailed calculations are shown for Cochran-Armitage

SnpSiftCmdCaseControl.pTrend(393):  CochranArmitageTest.p([ 46, 144, 0 ], [ 41, 16, 0 ], [  0        ,  1.000    ,  2.000     ]) = 1.8563650616698624E-11
SnpSiftCmdCaseControl.pGenotypic(299):  pGenotypic: NaN ChiSquaredDistribution(2).cumulativeProbability(chi2): NaN

This is followed by Fisher exact tests for Allelic, Dominant, and Recessive modalities :

SnpSiftCmdCaseControl.pAllelic(172):    pAllelic: 4.7409354467764136E-7 FisherExactTest.pValueDown(16, 494, 114, 160, 1.0): 4.7409354467764136E-7   R: phyper( 16, 114, 380, 160, lower.tail = TRUE )   FisherExactTest.pValueUp(16, 494, 114, 160, 1.0): 0.999999878657337 R: phyper( 15, 114, 380, 160, lower.tail = FALSE )
SnpSiftCmdCaseControl.pDominant(244):   pDominant: 9.886717315139457E-11    FisherExactTest.pValueDown(16, 247, 57, 160, 1.0): 9.886717315139457E-11    R: phyper( 16, 57, 190, 160, lower.tail = TRUE )    FisherExactTest.pValueUp(16, 247, 57, 160, 1.0): 0.9999999999879926 R: phyper( 16, 57, 190, 160, lower.tail = FALSE )
SnpSiftCmdCaseControl.pRecessive(321):  pRecessive: 1.0 FisherExactTest.pValueDown(0, 247, 57, 0, 1.0): 1.0 R: phyper( 0, 57, 190, 0, lower.tail = TRUE )   FisherExactTest.pValueUp(0, 247, 57, 0, 1.0): 1.0   R: phyper( 0, 57, 190, 0, lower.tail = FALSE )

Explanation of one of the lines (the one you were interested in):

SnpSiftCmdCaseControl.pDominant(244):   # This is the exact location of the line of code, in case you want to read the source code for more details.
    pDominant: 9.886717315139457E-11                                            # p-value from Fisher exact test. This is the minimum p-value from Lower and Upper tail
    FisherExactTest.pValueDown(16, 247, 57, 160, 1.0): 9.886717315139457E-11    # Arguments and function used to calculate the "lower tail" p-value
    R: phyper( 16, 57, 190, 160, lower.tail = TRUE )                            # This is an equivalent R function to calculate the same (so you can check the calculation using R)
    FisherExactTest.pValueUp(16, 247, 57, 160, 1.0): 0.9999999999879926         # Same for the "upper tail"
    R: phyper( 16, 57, 190, 160, lower.tail = FALSE )                           # R function to calculate the upper tail p-value```

Interpretation of the results

The result of the lower tail p-value is 9.8e-11, let's check using R

R
> phyper( 16, 57, 190, 160, lower.tail = TRUE )
[1] 9.886717e-11

well, that looks the same to me, so it seems that the value is correct.

Explanation with the "urn analogy" (I'm following the analogy of black/white balls is from R's phyper documentation): phyper( 16, 57, 190, 160, lower.tail = TRUE ):

There are 57 cases ("the number of white balls in the urn")
There are 190 controls ("the number of black balls in the urn")
There are 160 mutations ("the number of balls drawn from the urn")
Out of those, only 16 are cases ("number of white balls drawn from the urn")

Under normal circumstances the number of "white balls drawn form the urn" (cases) you'd expect would be:

    57
---------  * 160 = 36.7
57 + 190

so, you'd expect to retrieve 37 mutations in the cases, but there are only 16, this is extremely surprising (i.e. very low p-value).

For instance, if "cases" denote a disease, you could interpret that the mutation is conferring "resistance to the disease", i.e. people having the mutation are less prone to have the disease, and that's why it would be found in many "controls" and only in a few "cases".

I hope this helps.

Pablo

pcingola / SnpSift

SnpSift CaseControl - statistics clarity #57

How to read the debug output

Interpretation of the results