Maybe false positive calls in deepvariant analysis of HiFi reads

PengJia6 commented 1 year ago

Hi DeepVariant team,

I have been using DeepVariant （v1.1.0）to call small variants for a human genome with approximately 50X coverage of HiFi reads. During my analysis, I noticed that DeepVariant consistently reports some false positives in some regions. For example, at position chr10:89013075-89013077, there is a SNV (see the IGV snapshot of both Illumina and HiFi reads), but DeepVariant identifies three variants at that location.

The vcf records by deepvariant:

chr10   89013075        .       T       TC      31.7    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:31:63:35,28:0.444444:31,0,43
chr10   89013076        .       CA      C       28.2    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:27:63:35,28:0.444444:28,0,34
chr10   89013077        .       A       C       30.6    PASS    .       GT:GQ:DP:AD:VAF:PL      0/1:18:34:0,34:1:30,0,17

The IGV snapshots:

I haven't identified any patterns in these regions. Have you encountered similar situations before? Could you please explain the possible reasons behind these false positive calls? I have provided three examples, including small BAM files, resulting VCF files, and my code, for you to replicate and investigate.

deepvariant_fp.zip

Other information of my test:

Operating system: Red Hat 4.8.5-36
DeepVariant version: 1.1.0
Installation method (Docker, built from source, etc.): Docker
reference : GRCh38_full_analysis_set_plus_decoy_hla.fa
Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) ~50x Human HiFi seuquencing data.

Thank you very much for your attention and support. I look forward to your response.

Best regards, Peng Jia

kishwarshafin commented 1 year ago

@PengJia6 ,

Hi, can you please share the command that you are running for the PacBio HiFi BAM file? Also, are you saying the variant is reported with HiFi data?

pgrosu commented 1 year ago

@kishwarshafin The command is in the deepvariant_small_test.smk file. Below are it's contents:

# ======================================================================================================================
# Project: Project_Human_iharbor
# Script : deepvariant_small_test.smk TODO check 
# Author : Peng Jia
# Date   :  2023/6/13
# Email  : pengjia@stu.xjtu.edu.cn
# Description: TODO
# ======================================================================================================================

regions = [("chr10", 59597476, 59597478), ("chr10", 89013076, 89013077), ("chr18", 75132618, 75132618)]
samtools = "/data/home/pengjia/miniconda3/envs/default/bin/samtools"

dir_work = "/data/home/pengjia/Project_Human_DATA/reNBT2/01_deepvariant_test/"
path_ref_grch38 = "/data/DATA/Reference/human/GRCh38.d1.vd1/genome/GRCh38.d1.vd1.fa"
path_ref = "/data/DATA/Reference/human/GRCh38_full_analysis_set_plus_decoy_hla/genome/GRCh38_full_analysis_set_plus_decoy_hla.fa"

rule all:
    input:
        expand(dir_work + "vcfs/ChineseQuartet.{region}.deepvariant.vcf.gz",region=[f"{i[0]}_{i[1]}_{i[2]}" for i in regions])

def TODO1(wildcard):
    return

rule extract_bam:
    input:
        bam="/data/DATA/ChineseQuartet/ref_based_analysis/aligned_reads/ChineseQuartet/LCL5/ChineseQuartet.LCL5.GRCh38.HiFi.minimap2.bam",
        bai="/data/DATA/ChineseQuartet/ref_based_analysis/aligned_reads/ChineseQuartet/LCL5/ChineseQuartet.LCL5.GRCh38.HiFi.minimap2.bam",
    output:
        bam=dir_work + "bams/ChineseQuartet.{region}.bam",
        bed=dir_work + "bams/ChineseQuartet.{region}.bed",
    run:
        chrom, start, end = f"{wildcards.region}".split("_")
        start = int(start) - 1000
        end = int(end) + 1000
        shell("{samtools} view -h -O BAM {input.bam} {chrom}:{start}-{end} > {output.bam}")
        shell("echo '{chrom}\t{start}\t{end}' > {output.bed}")

rule deepvariant:
    input:
        bam=dir_work + "bams/ChineseQuartet.{region}.bam",
        bai=dir_work + "bams/ChineseQuartet.{region}.bam.bai",
        bed=dir_work + "bams/ChineseQuartet.{region}.bed",
        ref=path_ref
    output:
        vcf_gz=dir_work + "vcfs/ChineseQuartet.{region}.deepvariant.vcf.gz",
        gvcf_gz=dir_work + "vcfs/ChineseQuartet.{region}.deepvariant.g.vcf.gz"
    # gvcf_gz=config["dir_variants"] + "dv/dv_details/{sample}/{sample}.{prefix}.dv.raw.g.vcf.gz"
    log:
        dir_work + "vcfs/ChineseQuartet.{region}.deepvariant.log"
    benchmark:
        dir_work + "vcfs/ChineseQuartet.{region}.deepvariant.log"
    threads: 48
    run:
        dir_tmp = str(output.vcf_gz).rstrip(".vcf.gz") + "_tmp"
        file_tmp = dir_tmp.split("/")[-1]
        shell("mkdir -p " + dir_tmp)
        bam_dir = "/".join(str(input.bam).split("/")[:-1])
        bam_file = str(input.bam).split("/")[-1]
        bed_file = str(input.bed).split("/")[-1]
        ref_dir = "/".join(str(input.ref).split("/")[:-1])
        ref_file = str(input.ref).split("/")[-1]
        output_dir = "/".join(str(output.vcf_gz).split("/")[:-1])
        output_file = str(output.vcf_gz).split("/")[-1].rstrip(".vcf.gz")

        shell('docker run '
              '-v "{bam_dir}":"/input" '
              '-v "{ref_dir}":"/ref" '
              '-v "{output_dir}":"/output" '
              'google/deepvariant:1.1.0 /opt/deepvariant/bin/run_deepvariant '
              '--model_type=PACBIO '
              '--ref=/ref/{ref_file} '
              '--reads=/input/{bam_file} '
              '--regions /input/{bed_file} '
              '--output_vcf=/output/{output_file}.vcf '
              '--output_gvcf=/output/{output_file}.g.vcf '
              '--num_shards={threads} '
              '--make_examples_extra_args min_mapping_quality=1,keep_supplementary_alignments=true '
              '--intermediate_results_dir /output/{file_tmp} 1>{log} 2>{log}')
        shell("{bcftools} view -Oz -o {output.vcf_gz} {output_dir}/{output_file}.vcf")
        shell("{bcftools} view -Oz -o {output.gvcf_gz} {output_dir}/{output_file}.g.vcf")

rule samtools_index:
    input:
        "{preifx}.bam"
    output:
        "{preifx}.bam.bai"
    run:
        shell("{samtools} index {input}")

kishwarshafin commented 1 year ago

Thanks @pgrosu ,

@PengJia6 ,

Here's how my IGV looks like:

And here are the variant calls with DeepVariant 1.5:

chr10   89013075    .   T   TC  43.6    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:42:63:35,28:0.444444:43,0,47
chr10   89013076    .   CA  C   47.2    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:45:63:35,28:0.444444:47,0,49
chr10   89013077    .   A   C   43.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:41:34:0,34:1:43,0,43

This means you need to fix your IGV setup.

On IGV version 2.16.1 go to "View > Preference > Third Gen" and Uncheck the "Hide indels < show indels threshold".

It should look like this:

Once done, load your bams one more time and the variants will appear.

AndrewCarroll commented 1 year ago

Hi @PengJia6 @pgrosu

As @kishwarshafin indicates, when you visualize the Insertion and Deletion items, you'll see these events.

However, this is only part of the story. It looks to me like these are two equivalent ways of representing the same variant. The underlying event is that one chain of Cs is one longer and another chain of As is one shorter. This can be represented as a C -> A SNP event, or as a deletion of A and an insertion of C.

The issue is that the mapper is not consistently choosing a representation for the reads. For Illumina data, DeepVariant would perform its own realignment and make a representation that is consistent within itself. But we don't do that realignment step for DeepVariant.

I suspect, but have not validated, that example here relates to the use of homopolymer compression by minimap. That is, the AA in the reference is represented as only a single A during mapping, which can cause the location of the deletion not to get reconciled.

It may be possible to fix this by mapping the sample with the --preset HiFi flag (which now does not use homopolymer compression). I am trying to take a bit of time to look at the BAM file itself to confirm that.

PengJia6 commented 1 year ago

Thank you for your prompt and professional response. I have reviewed my files and reconfigured my IGV. As @AndrewCarroll mentioned, this anomaly is that two different representations of a varaint. While both representations are equivalent, I believe that people might prefer a simpler representation here (e.g., homozygous SNV instead of three heterozygous varaints). Do you have any recommendations for post-processing methods to normalize these types of varaints into a single representation? Alternatively, can DeepVariant introduce more advanced models or encoding methods to handle this situation?

AndrewCarroll commented 1 year ago

Hi @PengJia6

Please see This IGV screenshot

which shows the reads after rolling back to FASTQ and mapping with different parameters (top row) versus the original alignments (bottom row) for one of the samples. You can see the top row has a consistent representation in the reads.

I mapped these with pbmm2 using --preset HiFi. However, you should be able to replicate this effect in minimap using the flag -ax map-hifi

If remapping is an option, this will be the cleanest way to improve these cases. If you are considering remapping, I want to mention that we recommend mapping to a version of GRCh38 without ALT contigs (for example GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz. We observe this to have higher accuracy across technologies, but it makes an especially large difference for long read technologies.

I also note that you are using DeepVariant v1.1, but your BAM files don't have HaplotypePhase labels from running WhatsHap. Before version v1.4, we recommend a 2 step process of call-phase-call with WhatsHap in the middle for phasing. At version v1.4 and later, we started to do the phasing within DeepVariant, eliminating the need for this. If I could convince you to update to DeepVariant v1.5, you should see substantially better Indel accuracy as a result.

Finally, I don't know of any ways of efficiently doing a realignment on PacBio reads after mapping, but if you find one, let me know. We are working on some interesting ideas that will be able to handle this case even without remapping, and potentially other cases which are even more complex, but that is still under development.

pgrosu commented 1 year ago

Thank you @AndrewCarroll. If the read is treated as a minimal first-class object (just a simple map/dictionary) would suffice, then you should be able to perform realignment after mapping. This way, it can become reactive without slowing down the analysis. Basically the information would not reside in a file, but rather encapsulated in the read itself.

@PengJia6 The thing is that DeepVariant is position-focused at a specific base. That is how make_examples generates the variants from the allele counter for a region of a sample, which gets updated every time a new read is added to it. For instance, if you explore the allele counts, you'll get something like this for the different positions (the output is 0-based, and used the 1-based to identify each one):

For Position: 89013075:

position {
  reference_name: "chr10"
  position: 89013074
}
ref_base: "T"
ref_supporting_read_count: 35
read_alleles {
  key: "m64154_210327_091530/103023686/ccs/0"
  value {
    bases: "TC"
    type: INSERTION
    count: 1
  }
}
read_alleles {
  key: "m64154_210327_091530/128910218/ccs/0"
  value {
    bases: "TC"
    type: INSERTION
    count: 1
  }
}
...

For Position: 89013076:

position {
  reference_name: "chr10"
  position: 89013075
}
ref_base: "C"
ref_supporting_read_count: 35
read_alleles {
  key: "m64154_210327_091530/103023686/ccs/0"
  value {
    bases: "CA"
    type: DELETION
    count: 1
  }
}
read_alleles {
  key: "m64154_210327_091530/128910218/ccs/0"
  value {
    bases: "CA"
    type: DELETION
    count: 1
  }
}
...

For Position: 89013077:

position {
  reference_name: "chr10"
  position: 89013076
}
ref_base: "A"
read_alleles {
  key: "m64154_210327_091530/142213575/ccs/0"
  value {
    bases: "C"
    type: SUBSTITUTION
    count: 1
  }
}
read_alleles {
  key: "m64154_210327_091530/4130912/ccs/0"
  value {
    bases: "C"
    type: SUBSTITUTION
    count: 1
  }
}
...

Given that the allele type (indel/substitution) changes around a specific position, you might be able to use that information from a VCF to consolidate it into one variant.

Hope it helps, Paul

PengJia6 commented 1 year ago

Hi @AndrewCarroll and @pgrosu

Thank you for your clear explanation. I understand the this case now and I am looking forward to seeing your new methods for handling these cases, as I believe it will be a significant improvement. I will explore using the method @pgrosu provided to temporarily process these varaints and ensure their uniformity.

I will upgrade the version of DeepVariant in the next release of our project (Chinese Quartet). Furthermore, I noticed some information about DeepTrio in deepvariant homepage. Does DeepTrio support joint calling for quartet families (parents and two children)?

Best! Peng

AndrewCarroll commented 1 year ago

Hi @PengJia6

To your question about whether DeepTrio could operate on a quartet. The framework for DeepTrio could be extended to quartets. This is something that we had talked about when building DeepTrio. However, we'd need to get extensively characterized training data for this, and it would take additional work to train this model.

In the end, we concluded that there aren't going to be enough people with quartets that it justified extending DeepTrio for this use case at that time (consider all the other things we are working on for DeepVariant and DeepConsensus). It is something that we might revisit someday, but not for the near future.

PengJia6 commented 1 year ago

Thank you all for your answers and help. I have received answers to all my questions! Best wishes to you all.

google / deepvariant