kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
95 stars 12 forks source link

Potentially inconsistent counting for reads supporting call #114

Open aroon-color opened 1 day ago

aroon-color commented 1 day ago

Hi @kcleal,

I'm seeing some weird behavior from Dysgu around read counting and was curious if you could help. I'll preface this by saying the sequencing data I'm working with is not WGS/WES but instead target enrichment data, which might be a confounding variable. Not entirely sure how Dysgu's modeling deals with more peaky data.

Attached is a BAM subset to the region I'm curious about in MUTYH and the corresponding Dysgu VCF. Dysgu command: .venv/bin/dysgu run b37/GRCh37.p12.fa temp sample_roi.bam --max-cov -1 -p 6 --min-support 3 --svs-out sample_roi.vcf -x --min-size 50 --metrics. Note that this BAM was created by taking the alignment file from dysgu fetch and subsetting it to 1:45790000-45800000.

sample_roi_bam_vcf.zip

I'm seeing Dysgu call a 67bp deletion starting at chr1:45797418 (\t -> \n to make reading easier):

1
45797418
132
T
<DEL>
.
PASS
SVMETHOD=DYSGUv1.6.7;SVTYPE=DEL;END=45797485;CHR2=1;GRP=131;NGRP=3;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=67;KIND=intra_regional;REP=0.163;REPSC=0.000;GC=63.33;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=0;OL=0;SU=1384;WR=1;PE=838;SR=0;SC=1199;BND=544;LPREC=1;RT=pe
GT:GQ:NMP:NMS:NMB:MAPQP:MAPQS:NP:MAS:SU:WR:PE:SR:SC:BND:SQC:SCW:SQR:BE:COV:MCOV:LNK:NEIGH:NEIGH10:RB:PS:MS:SBT:NG:NSA:NXA:NMU:NDC:RMS:RED:BCC:FCC:STL:RAS:FAS:ICN:OCN:CMP:RR:JIT:PROB
1/1:200:0.48:0.0:0.45:59.92:0.0:1085:0:1384:1:838:0:1199:544:-1.0:0.0:0.993:0:16.88:253.0:0:0:0:150:625:758:0.0:0.01:0.84:0.0:0:1:0:0:0:1.634:2:0:0:2.892:1.77:50.67:0.154:0.0:0.487

The part that's confusing to me is the read metrics Dysgu says support the call, they are: SU=1384;WR=1;PE=838;SR=0;SC=1199;BND=544. But if I manually inspect that region in IGV, I see no evidence of the hundreds of SU, PE, SC, or BND alignments that would support this call? Average coverage across this region is ~110x. I see a similar pattern with short DEL (\~<80bp) and INV (\~<400bp) across several of the samples I've tested so far, where Dysgu seems to be producing a very confident call given the read support indicated but then inspection of the alignment doesn't back up these calls.

IGV 2024-11-05 - 08 53 16

Happy to provide more details if that would help!

@kkchau

kcleal commented 21 hours ago

Hi @aroon-color,

Thanks for sending over the example, it makes it much easier to investigate. I agree it looks like the support is very far off. I will need to run some tests to get to the bottom of this, but my guess is that it has something to do with the very high level of soft-clipping in the reads. Perhaps there are primers in the reads that need to be trimmed? It looks like almost all the reads in the area are soft-clipped. There are also lots of inversion-pattern reads (green colour below) - is this to be expected?

Screenshot 2024-11-05 at 20 32 54
aroon-color commented 21 hours ago

We do expect a slightly higher than usual number of reads to look like they have inversion signals but they're really just R1/R2 being assigned incorrectly during demux. The softclips are generally just PolyA-like sequences that result from the enrichment library prep, but I can also take a closer look at just these reads

kcleal commented 21 hours ago

I see. I will have a look in to as soon as possible.