google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.23k stars 727 forks source link

Strange false positive call #280

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello, (me again sorry), I am eyeballing bam files for evaluation of candidates. Mostly it's fine (I open a lot of issues but I want to stress that most reported variants seem correct). But here is a very, very strange case to my eyes.

Here is the gvcf line

chromosome_1    17434065    .   C   T,<*>   29.1    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:29:64:38,26,0:0.40625,0:29,0,99,990,990,990

That looks all right to me, seems like a "solid" candidate. But now, let's look at the bam

CT the relevant position is the first C starting from the right. As you see, there is not a single T base there. The site seems perfectly homozygous for C:C.

However, a bit on the left, you can see that many of the mapped reads abruptly end at the same position with a T. It's not a variant but something a bit strange seems to happen in that region.

So far it's the only such case I have. Do you have any idea of what's going on?

Here are the lines from the gVCF before the site. The abrupt T position is at 17434056, so in a block with no candidate. But just right after, for a few bases, DeepVariant threw a few calls that are completely not supported by the mapping.

zgrep -w -C8 "17434065" H4A4.g.vcf.gz
chromosome_1    17434049    .   T   G,<*>   29.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:29:62:36,25,0:0.403226,0:29,0,99,990,990,990
chromosome_1    17434050    .   G   <*> 0   .   END=17434056    GT:GQ:MIN_DP:PL 0/0:50:59:0,186,1859
chromosome_1    17434057    .   G   T,<*>   25.7    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:59:33,26,0:0.440678,0:25,0,71,990,990,990
chromosome_1    17434058    .   T   <*> 0   .   END=17434058    GT:GQ:MIN_DP:PL 0/0:50:63:0,189,1889
chromosome_1    17434059    .   G   T,<*>   25  PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:25:63:37,26,0:0.412698,0:25,0,77,990,990,990
chromosome_1    17434060    .   A   <*> 0   .   END=17434062    GT:GQ:MIN_DP:PL 0/0:50:64:0,165,1889
chromosome_1    17434063    .   G   C,<*>   26.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:64:38,25,0:0.390625,0:26,0,99,990,990,990
chromosome_1    17434064    .   A   <*> 0   .   END=17434064    GT:GQ:MIN_DP:PL 0/0:50:64:0,192,1919
chromosome_1    17434065    .   C   T,<*>   29.1    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:29:64:38,26,0:0.40625,0:29,0,99,990,990,990

any idea of what's going on there? It's a bit annoying in the sense that I don't know how I could have caught that without eyeballing the alignment. In my experiment I plan to eyeball all my candidates anyway (because there are less than 100 of them).

ghost commented 4 years ago

Ok, I found some other cases, everytime in the vicinity of an abrupt mapping ending W1 W2

Is DeepVariant catching what are reference genome misassemblies?

gunjanbaid commented 4 years ago

@aderzelle DeepVariant performs local realignment by default (can be disabled with --norealign_reads for the make_examples step). Realignment may cause this region to look a bit different than what you see in the original BAM, explaining the candidates that are generated.

By default, the realigned reads do not get written out, but you can use the flags below to output the realigned reads to a file called realigned_reads.bam. These flags can be added to the make_examples step so that you can inspect the realigned data.

--emit_realigned_reads - enables writing out of realigned reads --realigner_diagnostics=/path/to/dir - path to which the reads will be written

ghost commented 4 years ago

Thank you,

How exactly does it perform the realignment? How does it differ from, let's say, bwa mem?

MariaNattestad commented 4 years ago

DeepVariant uses Smith-Waterman for realignment, specifically this library: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.

serge2016 commented 4 years ago

Hello! But what behavior is correct for the two examples in this issue? It seems that this "local realignment" breaks the correct one and produces incorrect calls!

MariaNattestad commented 4 years ago

We can help take a look if you send us the bam and realigned bam, and even the make_examples output, if you can share any of these. Otherwise, you may want to experiment with turning realignment off by using the --norealign_reads flag.

ghost commented 4 years ago

Hello, isn't it a case of ... "well, here we need wetlab Sanger sequencing"? I certainly can share, but I need to find a time slot to relaunch DeepVariant with the production of evidence bam. I will be on it because I am curious. @MariaNattestad where can I send you the bam?

MariaNattestad commented 4 years ago

Hi @aderzelle If the question is why a variant is getting called when you see no sign of it in the original bam, then that can only be answered by inspecting the realigned bam, not by additional sequencing. It seems likely given the split reads that you are seeing the signal of some kind of misassembly or structural variant (if the genomes producing the bam and the reference are not the same). If seeing the realigned bam does not answer your question, then you can send the files to marianattestad@google.com and I can take a look. I would need to see the original bam, the realigned bam, the reference genome, and the position of the variant you are asking about.

ghost commented 4 years ago

@aderzelle DeepVariant performs local realignment by default (can be disabled with --norealign_reads for the make_examples step). Realignment may cause this region to look a bit different than what you see in the original BAM, explaining the candidates that are generated.

By default, the realigned reads do not get written out, but you can use the flags below to output the realigned reads to a file called realigned_reads.bam. These flags can be added to the make_examples step so that you can inspect the realigned data.

--emit_realigned_reads - enables writing out of realigned reads --realigner_diagnostics=/path/to/dir - path to which the reads will be written

Hello, how are we supposed to pass those arguments with the one-step wrapper script?

I tried

docker run -v "${INPUT_DIR}":"/input" -v "${OUTPUT_DIR}:/output" google/deepvariant:"${BIN_VERSION}" /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref=/input/shasta_final.fa --reads="/input/${SAMPLE}.sorted.bam" --regions="/input/ARCcestor.bed" --output_vcf=/output/${SAMPLE}.vcf.gz --output_gvcf=/output/${SAMPLE}.g.vcf.gz --num_shards="${N_SHARDS}" --customized_model="/input/mosquito_model/model.ckpt-97700" --make_examples_extra_args=emit_realigned_reads,realigner_diagnostics=/media/urbe/MyADrive/12-03-2020_DeepVariant_NDPD/bam_diagnostics

but runs into

Traceback (most recent call last):                                                                                                                                                                          
  File "/opt/deepvariant/bin/run_deepvariant.py", line 317, in <module>                                                                                                                                     
    app.run(main)                                                                                                                                                                                           
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run                                                                                                                               
    _run_main(main, args)                                                                                                                                                                                   
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main                                                                                                                         
    sys.exit(main(argv))                                                                                                                                                                                    
  File "/opt/deepvariant/bin/run_deepvariant.py", line 304, in main                                                                                                                                         
    commands = create_all_commands()                                                                                                                                                                        
  File "/opt/deepvariant/bin/run_deepvariant.py", line 276, in create_all_commands                                                                                                                          
    sample_name=FLAGS.sample_name))                                                                                                                                                                         
  File "/opt/deepvariant/bin/run_deepvariant.py", line 182, in make_examples_command                                                                                                                        
    kwargs.update(_extra_args_to_dict(extra_args))                                                                                                                                                          
  File "/opt/deepvariant/bin/run_deepvariant.py", line 135, in _extra_args_to_dict                                                                                                                          
    (flag_name, flag_value) = extra_arg.split('=')                                                                                                                                                          
ValueError: need more than 1 value to unpack     

EDIT: ah ok, the flag must explicitly be set to "=true" ^^'