google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.23k stars 725 forks source link

Variant calls after local realignment : is it the most accurate ? #763

Closed Fred-07 closed 5 months ago

Fred-07 commented 10 months ago

Dear all,

I would like to share an observation with you. Deepvariant reported a lot of variants in a region where no reads are mapped. It is the consequence of Deepvariant local realignment as expected. But the realigned reads do not seem to belong to this region : the orignal alignment shows less variants than the local realignment. The genome contains a big region repeated several times (see on the screenshots below). The repeats could be the reason why some reads are moved to the left. The given example is for X chromosome but another region in chromosome 1 (FLG2 gene region) showed a lot of variants.

Why the reads are moved to the left? Are the reads moved to the most possible left position?

Original alignment and detected variants (bwa-mem) image

Local realignment made by DeepVariant (X_140993145_140994144) image

Example of reads that were moved: Original alignment NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993994 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE XA:Z:X,+140993784,79M,2; PG:Z:MarkDuplicates AS:i:74 XS:i:69 MD:Z:17C61 NM:i:1 RG:Z:DM_23_2198 NB501857:464:HH7FWBGXV:2:23210:26812:14806 147 X 140994064 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA XA:Z:X,-140994589,50M3D29M,4; PG:Z:MarkDuplicates AS:i:79 XS:i:67 MD:Z:79 NM:i:0 RG:Z:DM_23_2198 Local realignment X:140993145-140994144/ X_140993145_140994144realigned_reads.bam X_140993145_140994144realigned_reads.bam.bai frmascla@frt:DeepV-TEST$ samtools view Local/X_140993145_140994144realigned_reads.bam | grep NB501857:464:HH7FWBGXV:2:23210:26812:14806 NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993784 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE NB501857:464:HH7FWBGXV:2:23210:26812:14806 19 X 140993854 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA

Original alignment, bam file https://www.dropbox.com/scl/fi/c9tc01sdtf2sroxj3u3bj/original_alignment.bam?rlkey=jgxnyhyse2ekcu6t1s3l3lnnl&dl=0 Local realignment, bam file https://www.dropbox.com/scl/fi/oqhny0s7h9hu3zcyrprig/X_140993145_140994144realigned_reads.bam?rlkey=zmbon72t19vjlcdht1zt6m5xg&dl=0

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md: Yes

Setup

lucasbrambrink commented 9 months ago

Hi,

Thanks for pointing this out! We are going to investigate this issue more closely and provide updates as we make progress. Thank you for providing links to the files, those are extremely helpful.

akolesnikov commented 5 months ago

Local realignment algorithm works as following:

To answer your question. Reads can move either direction as a result of the realignment. To investigate it further it would be helpful to see what haplotype was created as a result of the realignment. Would you be able to provide the reference file that you used?

Fred-07 commented 5 months ago

Thank you for these explanations. Please find below the link for the reference genome (hg19, slightly modified) https://www.dropbox.com/scl/fi/e9imci4i4rkv5vcpgujus/genome_PAR_masked.fasta.gz?rlkey=5fxpz3fbzy70glish6xwhckc9&dl=0

akolesnikov commented 5 months ago

Thank you for providing the reference. We were able to reproduce the issue locally. Meanwhile one workaround I'd like to suggest to analyze the specific position or small region where local realignment doesn't work is to run DeepVariant without the local realignment. It can be done by adding --realign_reads=false to --make_examples_extra_args flag and specifying a region with --regions flag.

AndrewCarroll commented 5 months ago

Hi @Fred-07,

As @akolesnikov mentions, the reason that the reads are re-aligned in this way has to do with some fundamental aspects of how alignment penalties work. The alignment method, Smith Waterman, has alignment score penalties for gap open and gap extend. These penalties make the alignments you see as scoring higher than ones that open and extend a gap early in the repeat.

Those alignment penalties are determined from looking at mismatch, insertion, and deletion rates across large amounts of homologous sequences. They work very well for high sequence complexity regions which have standard types of DNA replication error. In this highly repetitive context, the ways that sequence (specifically repeats) can be extended or deleted is a bit different. It might, in theory, be possible to derive better gap extend scores for regions like this, but this is both very hard and a more fundamental problem. I don't think this case is one we can easily solve without very extensive work.

I hope that the @akolesnikov suggestion of turning off realigner in regions of interest will be satisfactory.

Thank you, Andrew

pichuan commented 5 months ago

Hi @Fred-07 , given that @akolesnikov and @AndrewCarroll have investigated and answered this issue, I'll close it now.