google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.24k stars 729 forks source link

Variant not called although obvious in IGV #892

Closed hagen-wende closed 1 month ago

hagen-wende commented 1 month ago

Hello,

I have the issue that two obvious variations are not called. This is WGS nanopore sequencing of a transgenic mouse. The region in question is alien sequence (human for the most part) that was introduced in the genome, so these reads connect to a certain region in the genome. There are three variations in the region and only one is called: image

This is my command (version 1.6.1 or 1.5.0 give the same result) :

docker run \
     -v "${INPUT_DIR}":"/input" \
     -v "${OUTPUT_DIR}":"/output" \
     google/deepvariant:"${BIN_VERSION}" \
     /opt/deepvariant/bin/run_deepvariant \
     --model_type=ONT_R104 \
     --ref=/input/$FASTA_FILE \
     --reads=/input/$BAM_FILE \
     --output_vcf=/output/output.vcf.gz \
     --output_gvcf=/output/output.g.vcf.gz \
     --regions "my_region" \
     --intermediate_results_dir /output/intermediate_results_dir \
     --num_shards=$(nproc) \
     --postprocess_variants_extra_args="debug_output_all_candidates=ALT" \
     --make_examples_extra_args="vsc_min_count_snps=2,vsc_min_fraction_snps=0.06,vsc_min_count_indels=2,vsc_min_fraction_indels=0.06"

I initially thought it might be this issue https://github.com/google/deepvariant/issues/366 but the borders are more than 110 bp away. So is there any way to tackle this issue?

kishwarshafin commented 1 month ago

Hi @hagen-wende ,

Do you have the VCF outputs of these variants? If the candidates are picked then the VCF should have RefCall or some other notation to the candidate. Can you please provide that?

hagen-wende commented 1 month ago

this is everything that is in the vcf, so the deviations are not picked up at all. It is just one line with the 5' deviation. If I include other genomic region, than I do see more variations there, but for this region it is not recognized at all. image

output.vcf.txt

However both positions (3948 and 3949) show up in the output.g.vcf output.g.vcf.txt

image

*update: I just tried the same data with clair3 and it also does not call these two positions.

I then aligned against only the transgen in the genomic context (~50 kb each side) and then the variants are called correctly, so it seemed to have something to do with the reads also mapping to the other region.

So I checked the reads again in more detail and it turns out that at the 5' SNP only 2 reads are primary and at the two 3' SNPs there is only one primary alignment, the others are supplementary. Therefore vsc_min_count_snps=2 filters the 3' SNPs out, while the 5' barely passes. I then reduced vsc_min_count_snps to 1 and now I find also the two 3' SNPs in the vcf output, with one passing and the other being Refcall. Of course plenty of other SNPs show up as well, but with low QUAL values and being RefCall ...

So this is an issue of the way that I make the reference FASTA, but I don't see any other way of doing this, because otherwise I would have to duplicate the surrounding genomic DNA postentially ending up with many supplementary alignments as well.

kishwarshafin commented 1 month ago

@hagen-wende ,

Thank you for looking into this. In your use-case, variant calling would be extremely difficult as mapping is difficult. Also, you are operating at 5x coverage. We only see the performance of ONT that are considered to be good from 15x. With that said, you can use DeepVariant create alleles for your use. Just drop the values under the threshold and all alleles will be reported so it at least gives you some idea of where the variants could be.

hagen-wende commented 1 month ago

well 5x coverage is in part because I just look at one allele here and it is also underrepresented, our overall WGS coverage is usually between 20-30x. I will also think of how to deal with this issue, but it is at least good to know, where the issue is. I can still align against the transgene in the genomic context and call the variants for this alignment and then do a separate alignment against the entire genome and call variants outside the transgene (if I need it). Thanks!