google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 713 forks source link

Force Genotyping #708

Closed genieusbio closed 12 months ago

genieusbio commented 1 year ago

Sometimes in variant calling, it's necessary to force genotype all known sites regardless if a variant is found. Illumina has it:

GATK also has a similar option.

How to do it in DeepVariant? I followed a DeepVariant tutorial at:

Unfortunately it didn't look like working for me. For example, chr1:783006 was in the force_calling.candidates.vcf.gz file but the site is not reported in the two output files: HG003.output.g.vcf.gz and HG003.output.vcf.gz.

pgrosu commented 1 year ago

Hi @genieusbio,

Does the BED file include the 783006 region of chromosome 1, and did you also specify the MAKE_EXAMPLE_ARGS variable? Version 1.1 is a bit old, but I understand why it is necessary from the code for vcf_candidate_importer.py, as the get_candidate_positions() function body is not necessary.

Thanks, Paul

pichuan commented 12 months ago

Hi @genieusbio ,

If I recall correctly, there was one thing that I meant to fix in that tutorial but haven't gotten around to: Currently, our force calling only works on biallelic entries. But regular DeepVariant VCF can contain multiallelic entries. So, one more addition step I would like to add is to at least split the proposed VCF into only biallelic entries.

Although, in this case, it didn't seem to be a multallelic entry. I checked:

$ zcat input/force_calling.candidates.vcf.gz | grep -P 'chr1\t783006\t'
chr1    783006  .       A       G       50      PASS    platforms=4;platformnames=PacBio,10X,Illumina,CG;datasets=6;datasetnames=CCS15kb_20kb,10XChromiumLR,HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=10;callsetnames=CCS15kb_20kbDV,10XLRGATK,CCS15kb_20kbGATK4,HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome;callable=CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable;filt=CS_CGnormal_filt,CS_HiSeqPE300xfreebayes_filt;difficultregion=AJtrio-HG003.hg38.300x.bam.bilkentuniv.072319.dups,hg38.segdups_sorted_merged,lowmappabilityall GT:PS:DP:ADALL:AD:GQ    1/1:.:803:15,330:0,101:494

Just from looking at the entry, without digging into it, I would think that this entry should show up at the end.

I'll take a closer look (and run through my own gist again) see if I can figure out.

pgrosu commented 12 months ago

Hi Pi-Chuan and @genieusbio,

Try it with --regions chr1:783006-783007 - you might see magic happen :)

$ zcat output/HG003.output.vcf.gz | grep -P 'chr1\t783006\t'
chr1    783006  .       A       G       0       RefCall .       GT:GQ:DP:AD:VAF:PL      0/0:30:0:0,0:0:0,33,33

$ zcat output/HG003.output.g.vcf.gz | grep -P 'chr1\t783006\t'
chr1    783006  .       A       G,<*>   0       RefCall .       GT:GQ:DP:AD:VAF:PL      0/0:30:0:0,0,0:0,0:0,33,33,990,990,990

My preference would be to update the BED file with that region.

~p

pichuan commented 12 months ago

I see. @pgrosu are you saying that if we just grep the output HG003.output.vcf.gz file (like I did with the input), we'll see it? I ended up not having time to test it last night. So I haven't gone through a run to check.

@genieusbio if you can confirm that, that will be great! I can also check later. I still need to work on a few other things before I get to this.

genieusbio commented 12 months ago

Yes, I can give a go.

pgrosu commented 12 months ago

Hi @genieusbio,

This first run is just a validation. So try it like in the following script with --regions chr1:783006-783007 and --num_shards 1 as well, as shown below. You might need to change your input/output/reference folders reflecting what you have on your system:

mkdir -p output
mkdir -p output/intermediate_results_dir

BIN_VERSION="1.1.0"
MAKE_EXAMPLE_ARGS="variant_caller=vcf_candidate_importer,proposed_variants=/input/force_calling.candidates.vcf.gz,vsc_min_count_snps=0,vsc_min_count_indels=0,vsc_min_fraction_snps=0,vsc_min_fraction_indels=0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WES \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
  --regions chr1:783006-783007 \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards 1 \
  --make_examples_extra_args="${MAKE_EXAMPLE_ARGS}" \
  --intermediate_results_dir /output/intermediate_results_dir

Let me know if there is anything I should expand for clarification. You might have to run it a couple of times to make it all go, but it should run fairly fast as the region is very small.

Thanks, Paul

pgrosu commented 12 months ago

Hi Pi-Chuan,

Yes, that is correct - you should see it in the generated output files (VCF and GVCF). The control-flow logic for finding these candidates is as follows for r1.1 of make_examples.py:

$1)$ On line 613, options.calling_regions is extended with the values from the user-defined --regions flag:

options.calling_regions.extend(parse_regions_flag(flags_obj.regions))

$2)$ On line 2082, in function make_examples_runner() a request is made to provide these regions for processing:

regions = processing_regions_from_options(options)

$3)$ On lines 1960-1961, in function processing_regions_from_options(), the variable calling_regions is initialized based on the contents of options.calling_regions and options.exclude_calling_regions:

  calling_regions = build_calling_regions(ref_contigs, options.calling_regions,
                                          options.exclude_calling_regions)

Which is then used to construct the regions and region_list on lines 1968-1975:

  regions = regions_to_process(
      contigs=contigs,
      partition_size=options.allele_counter_options.partition_size,
      calling_regions=calling_regions,
      task_id=options.task_id,
      num_shards=options.num_shards)

  region_list = list(regions)

$4)$ Continuing with function make_examples_runner(), on lines 2105-2106 it finds candidates based on these regions, creating also the corresponding example files:

    for region in regions:
      candidates, examples, gvcfs, runtimes = region_processor.process(region)

Hope it helps, Paul

pichuan commented 12 months ago

Hi @genieusbio ,

One thing to note:

In https://gist.github.com/pichuan/baba6ee9bd9890be2a45076a4934dd38 , when you run run_deepvariant, this flag is specified: --regions /input/idt_capture_novogene.grch38.bed.

Which means only the variants within the regions specified in the BED is used.

If you look at the BED file:

$ head -5 ./input/idt_capture_novogene.grch38.bed
chr1    69090   70008
chr1    450739  451678
chr1    685715  686654
chr1    925941  926013
chr1    930154  930336

which does not include the region you're looking for.

If you want to force call everything in that BAM file, you can simply remove the flag --regions /input/idt_capture_novogene.grch38.bed from your run. The downside is that will take longer.

Or, if you just want to make sure that particular region is covered, then you can use @pgrosu 's suggestion and specify a small region that covers that range.

Hopefully this is clear.