hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
187 stars 58 forks source link

SAGE: repeated variants with different read context in output vcf #373

Closed rebber closed 1 year ago

rebber commented 1 year ago

Hi,

I'm running SAGE v 3.2.3 and have discovered that some variants appear multiple times in the same vcf. CHROM, POS, REF, ALT is the same, but the read context (INFO tag RC) differs slightly and thus also other tags related to read context.

Examples:

6       138818795       .       A       C       262     strandBias      RC=TCCAG;RC_IDX=2;RC_LF=AGTACTTAGC;RC_NM=1;RC_REPC=2;RC_REPS=C;RC_RF=AAATAACTCA;REP_C=2;REP_S=A;TIER=PANEL;TNC=CAA      GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:0:829,0:0:829:37097,0:908,0:0,0,0,0,0,829,829:0:0,0,0:0,0,0,0,0,19217,19217:908:0.5 ./.:34:5787,13:0.002241:5800:313340,442:6402,13:7,1,0,0,5,5787,5800:0:0,0,0:241,21,0,0,162,138985,139409:6415:1
6       138818795       .       A       C       162     minTumorQual;strandBias RC=TCCNG;RC_IDX=2;RC_LF=AGTACTTAGC;RC_NM=2;RC_REPC=2;RC_REPS=C;RC_RF=AAATAACTCA;REP_C=2;REP_S=A;TIER=PANEL;TNC=CAA      GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:0:829,0:0:829:37097,0:908,0:0,0,0,0,0,829,829:0:0,0,0:0,0,0,0,0,18721,18721:908:0.5 ./.:34:5787,13:0.002241:5800:313340,442:6402,13:5,0,0,0,8,5787,5800:0:0,0,0:162,0,0,0,262,128136,128560:6415:1
7       13988790        .       T       A       272     PASS    RC=ATTAAAAAAAAAAAAT;RC_IDX=3;RC_LF=TATAGCAGTG;RC_NM=1;RC_REPC=12;RC_REPS=A;RC_RF=CTCTGTAAAA;REP_C=11;REP_S=A;TIER=PANEL;TNC=TTA GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:0:368,2:0.003854:519:15821,68:401,2:0,0,0,0,2,368,519:0:0,0,0:0,0,0,0,32,8216,11048:553:0.5 ./.:52:2991,28:0.007156:3913:164854,1964:3258,43:8,2,0,0,18,2991,3913:0:8,4,26:271,27,0,0,243,71012,92254:4225:0.679
7       13988790        .       T       A       121     minTumorQual    RC=ATTAAAAAAAAAAAAAT;RC_IDX=4;RC_LF=TATAGCAGTG;RC_NM=2;RC_REPC=13;RC_REPS=A;RC_RF=CTCTGTAAAA;REP_C=11;REP_S=A;TIER=PANEL;TNC=TTA        GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:0:349,2:0.004065:492:15821,68:401,2:0,0,0,0,2,349,492:0:0,0,0:0,0,0,0,32,8057,10838:553:0.5 ./.:37:2862,28:0.007511:3728:164854,1964:3258,43:6,0,0,0,22,2862,3728:0:7,0,15:136,0,0,0,402,69776,90734:4225:0.679
10      89648905        .       A       T       382     maxGermlineRelRawBaseQual       RC=CTTTTTTTTTTTAAAT;RC_IDX=11;RC_LF=CTGTCTTCTG;RC_NM=1;RC_REPC=11;RC_REPS=T;RC_RF=GAATGGATAT;REP_C=10;REP_S=T;TIER=PANEL;TNC=TAA        GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:39:436,3:0.006834:439:18489,113:479,3:2,0,0,0,1,436,439:1:0,0,0:24,0,0,0,17,9720,9761:482:0.333     ./.:47:2801,24:0.008475:2832:148048,1740:3043,40:11,4,0,0,9,2801,2832:0:3,4,14:372,24,0,0,211,66239,67008:3083:0.375
10      89648905        .       A       T       124     maxGermlineRelRawBaseQual;minTumorQual  RC=TGCTTTTTTTTTTTTAAAT;RC_IDX=14;RC_LF=TGCTGTCTTC;RC_NM=2;RC_REPC=12;RC_REPS=T;RC_RF=GAATGGATAT;REP_C=10;REP_S=T;TIER=PANEL;TNC=TAA     GT:ABQ:AD:AF:DP:RABQ:RAD:RC_CNT:RC_IPC:RC_JIT:RC_QUAL:RDP:SB    ./.:34:410,3:0.007264:413:18489,113:479,3:1,0,0,0,2,410,413:0:2,0,4:17,0,0,0,24,9122,9163:482:0.333     ./.:39:2635,23:0.00863:2665:148048,1740:3043,40:5,1,0,0,17,2635,2665:1:10,0,20:117,27,0,0,433,62217,62947:3083:0.348

Why is it that they're output separately like this? How should I think about the quality for such variants and how to interpret them? E.g. the chr 7 case, where one entry says FILTER =PASS and the other is filtered out? Do you have any recommendations for merging the duplicate variants, or how to prioritize between them?

I have run with the -panel_only argument because I have targeted data, and have adjusted options according to https://github.com/hartwigmedical/hmftools/blob/master/README_TARGETED.md but with a lower min VAF. This is the exact command I have run:

minAF=0.0002
bqThres=30  # base quality threshold
mqThres=20  # mapping quality threshold
hpMinTQ=150  # hotspot min tumor qual
panelMinTQ=250  # panel min tumor qual
threads=6
java -Xms4G -Xmx32G -cp /path/to/sage_v3.2.3.jar com.hartwig.hmftools.sage.SageApplication \
    -threads $threads -reference $nameN -reference_bam $bamN \
    -tumor $nameT -tumor_bam $bam -ref_genome_version 37 -ref_genome $ref \
    -hotspots $hotspots -panel_bed $targets -ensembl_data_dir $ensembl \
    -out $outVcf -hard_min_tumor_vaf $minAF -hotspot_min_tumor_qual $hpMinTQ \
    -hotspot_min_tumor_vaf $minAF -min_map_quality $mqThres -min_avg_base_qual $bqThres \
    -panel_min_tumor_qual $panelMinTQ -panel_min_tumor_vaf $minAF \
    -panel_only -write_bqr_data

Best regards Rebecka

p-priestley commented 1 year ago

Hi Rebecka,

SAGE calls a candidate variant for each unique chr,pos,ref,alt AND readContext. The readContext includes 2 flanking bases on each side and sometimes more if there is a repeat which makes the variant ambiguous, so it is possible and expected that we may have multiple records with the same variant but different context. The typical case where this may occur is when you have 2 somatic variants within 2 bases of each other which is partially phased (ie all reads of variant B contain variant A, but only some reads with variant A contain variant B). The most common case for this is a mixed somatic germline MNV. Another case is when you call the same variant, but with repeat contexts of different length (such as your example above at 8:13988790).

Later when we phase, we use the ref context to help with the phasing, including for identifying partially phased variants. Only PASS variants are considered for this phasing. If variants are partially phased we will normally call a phased MNV and an unphased SNV, with the duplicate SNV deduped by the MNV.

In practice, I think it is very unlikely that we call 2 identical variants with different read contexts where both variants PASS, but I cannot rule this out.

So in summary: this is expected behaviour, but should be exceptionally rare to have 2 duplicate variants that both PASS.

Peter