nygenome / lancet

Microassembly based somatic variant caller for NGS data
Other
153 stars 33 forks source link

format of Multi-allelic SNPs in VCF #27

Closed ahwanpandey closed 6 years ago

ahwanpandey commented 6 years ago

Hello,

I have attached variant calls for the same variant using Mutect2, Varscan2 and Lancet. They have been fed into GATK's CombineVariants to generate consensus calls. I have noticed that only Lancet outputs "multi-allelic" SNPs which are both PASS and FAIL in the same position but different alt alleles, in different lines. Is this a known behavior? Is there a way to output just the passing variant in the multi-allelic site, or a combined variant per position? As you can see in the combinations found by CombineVariants, things start getting messy with the categories that have (Lancet-filterInLancet) and (filterInLancet-Lancet)

The following is an example for the category: set=Varscan2-Mutect2-filterInLancet-Lancet

# Mutect2
11      37085873        .       C       T       .       PASS    DP=112;ECNT=1;NLOD=11.68;N_ART_LOD=-1.311e+00;POP_AF=3.125e-05;P_GERMLINE=-9.889e+00;TLOD=61.90 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:46,24:0.349:30,12:16,12:30:318,386:60:20:true:false:0.500:2.890e-11:49.00:100.00:0.323,0.313,0.343:0.014,0.024,0.963        0/0:39,0:3.139e-03:23,0:16,0:0:303,0:0:0:false:false

#  Varscan2
11      37085873        .       C       T       .       PASS    DP=114;SOMATIC;SS=2;SSC=58;GPV=1E0;SPV=1.3654E-6        GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:40:40:0:0%:30,10,0,0      0/1:.:74:46:26:36.11%:26,20,11,15

# Lancet
11      37085873        .       C       A       2.57564 LowFisherScore;LowVafTumor;LowAltCntTumor;StrandBias    SOMATIC;FETS=2.57564;TYPE=snv;LEN=1;KMERSIZE=13;SB=2.80827      GT:AD:SR:SA:DP  0/0:34,0:24,10:0,0:34     0/1:41,1:21,20:1,0:42
11      37085873        .       C       T       46.7969 PASS    SOMATIC;FETS=46.7969;TYPE=snv;LEN=1;KMERSIZE=13;SB=6.76323      GT:AD:SR:SA:DP  0/0:40,0:27,13:0,0:40   0/1:46,20:23,23:10,10:66

# GATK CombineVariants
11      37085873        .       C       T,A     2.58    PASS    AC=2,1;AF=0.167,0.083;AN=12;DP=226;SOMATIC;set=Varscan2-Mutect2-filterInLancet-Lancet   GT:AF:DP:DP4:F1R2:F2R1:FREQ:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:RD:SA:SA_MAP_AF:SA_POST_PROB:SR       0/2:.,.:42:.:.,.,.:.,.,.:.:.,.:.,.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.:1,0:.,.,.:.,.,.:21,20    0/1:0.349:.:.:30,12:16,12:.:30:318,386:60:20:true:false:0.500:2.890e-11:49.00:100.00:.:.:0.323,0.313,0.343:0.014,0.024,0.963      0/1:.,.:74:26,20,11,15:.,.,.:.,.,.:36.11%:.,.:.,.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:46  0/0:.,.:34:.:.,.,.:.,.,.:.:.,.:.,.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.:0,0:.,.,.:.,.,.:24,10      0/0:3.139e-03:.:.:23,0:16,0:.:0:303,0:0:0:false:false   0/0:.,.:40:30,10,0,0:.,.,.:.,.,.:0%:.,.:.,.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:.,.:40

And all the categories formed in the full data:

2298000 set=FilteredInAll
      2 set=filterInLancet-Lancet
      1 set=filterInMutect2-filterInLancet-Lancet
   1183 set=filterInMutect2-Lancet
     81 set=filterInVarscan2-filterInMutect2-Lancet
     84 set=filterInVarscan2-Lancet
    152 set=filterInVarscan2-Mutect2
     54 set=filterInVarscan2-Mutect2-filterInLancet
    478 set=filterInVarscan2-Mutect2-Lancet
  12325 set=Intersection
   1885 set=Lancet
      4 set=Lancet-filterInLancet
   1176 set=Mutect2
    558 set=Mutect2-filterInLancet
      2 set=Mutect2-filterInLancet-Lancet
    888 set=Mutect2-Lancet
      1 set=Mutect2-Lancet-filterInLancet
   3572 set=Varscan2
    301 set=Varscan2-filterInLancet
    772 set=Varscan2-filterInMutect2
    180 set=Varscan2-filterInMutect2-filterInLancet
    242 set=Varscan2-filterInMutect2-Lancet
    146 set=Varscan2-Lancet
      1 set=Varscan2-Lancet-filterInLancet
    714 set=Varscan2-Mutect2
    221 set=Varscan2-Mutect2-filterInLancet
      4 set=Varscan2-Mutect2-filterInLancet-Lancet
      3 set=Varscan2-Mutect2-Lancet-filterInLancet

And if I just did combine variants on Varscan2 and Mutect2

 186162 set=FilteredInAll
   1195 set=filterInMutect2-Varscan2
  13267 set=Intersection
   2625 set=Mutect2
    684 set=Mutect2-filterInVarscan2
   4019 set=Varscan2
gnarzisi commented 6 years ago

Yes, the expected behavior for Lancet is to report multiple alleles at the same position in separate lines. Can you elaborate in more details why this is a problem for your analysis?

ahwanpandey commented 6 years ago

Thank for your reply. It not so much a problem but just an observation that arose some complexity during downstream analysis when using a consensus calling tool such as CombineVariants. Nothing that is a huge issue though. I just wasn't aware that variants could be represented in multiple lines for multi-allelic sites. The other tools I had tested (sprecifically VarScan2, VarDict and Mutect2) were only outputting the passing variant at that site or representing them as comma separated variants. Also, only a very few number of variants were actually reported as such (having both a passing and a failing variant at the same position) by Lancet i.e. 18 out of 2,323,030 total Somatic variants.

        2 set=filterInLancet-Lancet
        1 set=filterInMutect2-filterInLancet-Lancet
    4 set=Lancet-filterInLancet
    2 set=Mutect2-filterInLancet-Lancet
    1 set=Mutect2-Lancet-filterInLancet
    1 set=Varscan2-Lancet-filterInLancet
    4 set=Varscan2-Mutect2-filterInLancet-Lancet
    3 set=Varscan2-Mutect2-Lancet-filterInLancet

I guess the main reason I posted this here is because Lancet was doing something slightly different than the 3 other tools when reporting variants in the VCF and wanted to confirm with the developer himself :)

gnarzisi commented 6 years ago

Glad everything is fine. I'll close the ticket then. Feel free to post any other issue or bug that you may face.