pcingola / SnpSift

Other
35 stars 20 forks source link

SnpSift CaseControl - statistics clarity #57

Closed dtlyfoung closed 3 years ago

dtlyfoung commented 3 years ago

When running SnpSift CaseControl, we observe variants that have high statistical significance, yet, when we inspect the variant in IGV, we often see that the samples we designated as our 'controls' that contain the variant outnumbers the number of samples we designated as 'cases' that have the variant.

Attached is 1) a table of variants generated from SnpSift extractFields where we pull in the CaseControl statistics among other fields we care about, 2) a shortened VCF (in .txt format in order to upload here) for the one variant example I am highlighting and 3) a screenshot of the "Variant Attributes" of this variant in IGV.

For this particular example, I want to highlight the variant in the variant table that is highlighted in yellow (gene CFH). When sorting ascending under the CC_DOM model, it has statistical significance to the negative 11th power. We have a total of 276 samples, 57 of which are "cases" and 191 which are "controls" (the other 28 are neutral/ignored). When we look in IGV, we see that there are 16 cases and 144 controls that show this heterozygous variant. I would like to understand the statistics calculations more for the models. One would assume that if more "control" samples have the variant than in the "case" samples, the variant would not have statistical significance to the negative 11th power so I was wondering if I could get a little more clarity so I can understand CaseControl more.

If there is anything else I can provide to help with the clarification, please let me know.

casecontrol_sorted_on_DOM.xlsx

1_55102593_G_A_example.txt

igv_screenshot_cfh

pcingola commented 3 years ago

You can use the "debug" command line option to have a slightly more detailed explanation on how this is calculated:

$ java -jar SnpSift.jar \
        caseControl \
        -d \
        "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++0000000000000000000000000000-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" \
        1_55102593_G_A_example.vcf \
        2>&1 | tee 1_55102593_G_A_example.out

How to read the debug output

Looking at the output file (1_55102593_G_A_example.out) you'll see:

# Note: Edited for readability
...
# VCF header (removed)
...

# This section shows whether a sample is 'case' 'control' or ignored:
    Sample  Case
    SEV_39795_PN124_PCSK9-4002  true
    SEV_39796_PN126_PCSK9-4001  true
    SEV_39798_PN128_PCSK9-4004  true
    SEV_39799_PN129_PCSK9-4003  true
    SEV_39800_PN130_PCSK9-4504  true
    SEV_39801_PN132_PCSK9-4501  true
    SEV_39803_PN135_PCSK9-4503  true
    SEV_39813_PN143_APOcIII-42  true
...

The next part, shows the input VCF line, followed by genotype (GT) counts for 'cases' and 'controls'. Note that the counts are summarized as [Hom_Ref , Het, Hom_Alt]

SnpSiftCmdCaseControl.annotate(66): chr1    196830683   1:55102593:G:A;rs140799744;915864   G   A   315903.0    PASS    AC=185;AF=0.336;AN=550;BaseQRankSum=-0.161;CSQ=A|intron_variant|MODIFIER||ENSMFAG00000039896|Transcript|ENSMFAT0000001
    Sample:   0 Type: Case      GT code: +0 nCases: [  1,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:46,0:46:48:.:.:0,48,720
    Sample:   1 Type: Case      GT code: +0 nCases: [  2,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:90,0:90:99:.:.:0,120,1800
    Sample:   2 Type: Case      GT code: +0 nCases: [  3,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:33,0:33:99:.:.:0,99,1302
    Sample:   3 Type: Case      GT code: +0 nCases: [  4,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:75,0:75:99:.:.:0,120,1800
    Sample:   4 Type: Case      GT code: +0 nCases: [  5,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:75,0:75:99:.:.:0,120,1800
    Sample:   5 Type: Case      GT code: +0 nCases: [  6,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:53,0:53:99:.:.:0,118,1800
    Sample:   6 Type: Case      GT code: +0 nCases: [  7,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:56,0:56:99:.:.:0,111,1665
    Sample:   7 Type: Case      GT code: +0 nCases: [  8,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:11,0:11:12:.:.:0,12,180
    Sample:   8 Type: Case      GT code: +0 nCases: [  9,   0,   0] nControls: [  0,   0,   0]  GT: 0/0:28,0:28:30:.:.:0,30,450
...

Immediately after that, the detailed calculations are shown for Cochran-Armitage

SnpSiftCmdCaseControl.pTrend(393):  CochranArmitageTest.p([ 46, 144, 0 ], [ 41, 16, 0 ], [  0        ,  1.000    ,  2.000     ]) = 1.8563650616698624E-11
SnpSiftCmdCaseControl.pGenotypic(299):  pGenotypic: NaN ChiSquaredDistribution(2).cumulativeProbability(chi2): NaN

This is followed by Fisher exact tests for Allelic, Dominant, and Recessive modalities :

SnpSiftCmdCaseControl.pAllelic(172):    pAllelic: 4.7409354467764136E-7 FisherExactTest.pValueDown(16, 494, 114, 160, 1.0): 4.7409354467764136E-7   R: phyper( 16, 114, 380, 160, lower.tail = TRUE )   FisherExactTest.pValueUp(16, 494, 114, 160, 1.0): 0.999999878657337 R: phyper( 15, 114, 380, 160, lower.tail = FALSE )
SnpSiftCmdCaseControl.pDominant(244):   pDominant: 9.886717315139457E-11    FisherExactTest.pValueDown(16, 247, 57, 160, 1.0): 9.886717315139457E-11    R: phyper( 16, 57, 190, 160, lower.tail = TRUE )    FisherExactTest.pValueUp(16, 247, 57, 160, 1.0): 0.9999999999879926 R: phyper( 16, 57, 190, 160, lower.tail = FALSE )
SnpSiftCmdCaseControl.pRecessive(321):  pRecessive: 1.0 FisherExactTest.pValueDown(0, 247, 57, 0, 1.0): 1.0 R: phyper( 0, 57, 190, 0, lower.tail = TRUE )   FisherExactTest.pValueUp(0, 247, 57, 0, 1.0): 1.0   R: phyper( 0, 57, 190, 0, lower.tail = FALSE )

Explanation of one of the lines (the one you were interested in):

SnpSiftCmdCaseControl.pDominant(244):   # This is the exact location of the line of code, in case you want to read the source code for more details.
    pDominant: 9.886717315139457E-11                                            # p-value from Fisher exact test. This is the minimum p-value from Lower and Upper tail
    FisherExactTest.pValueDown(16, 247, 57, 160, 1.0): 9.886717315139457E-11    # Arguments and function used to calculate the "lower tail" p-value
    R: phyper( 16, 57, 190, 160, lower.tail = TRUE )                            # This is an equivalent R function to calculate the same (so you can check the calculation using R)
    FisherExactTest.pValueUp(16, 247, 57, 160, 1.0): 0.9999999999879926         # Same for the "upper tail"
    R: phyper( 16, 57, 190, 160, lower.tail = FALSE )                           # R function to calculate the upper tail p-value```

Interpretation of the results

The result of the lower tail p-value is 9.8e-11, let's check using R

R
> phyper( 16, 57, 190, 160, lower.tail = TRUE )
[1] 9.886717e-11

well, that looks the same to me, so it seems that the value is correct.

Explanation with the "urn analogy" (I'm following the analogy of black/white balls is from R's phyper documentation): phyper( 16, 57, 190, 160, lower.tail = TRUE ):

Under normal circumstances the number of "white balls drawn form the urn" (cases) you'd expect would be:

    57
---------  * 160 = 36.7
57 + 190

so, you'd expect to retrieve 37 mutations in the cases, but there are only 16, this is extremely surprising (i.e. very low p-value).

For instance, if "cases" denote a disease, you could interpret that the mutation is conferring "resistance to the disease", i.e. people having the mutation are less prone to have the disease, and that's why it would be found in many "controls" and only in a few "cases".

I hope this helps.

Pablo