Closed genieusbio closed 12 months ago
Hi @genieusbio,
Does the BED file include the 783006 region of chromosome 1, and did you also specify the MAKE_EXAMPLE_ARGS
variable? Version 1.1 is a bit old, but I understand why it is necessary from the code for vcf_candidate_importer.py
, as the get_candidate_positions()
function body is not necessary.
Thanks, Paul
Hi @genieusbio ,
If I recall correctly, there was one thing that I meant to fix in that tutorial but haven't gotten around to: Currently, our force calling only works on biallelic entries. But regular DeepVariant VCF can contain multiallelic entries. So, one more addition step I would like to add is to at least split the proposed VCF into only biallelic entries.
Although, in this case, it didn't seem to be a multallelic entry. I checked:
$ zcat input/force_calling.candidates.vcf.gz | grep -P 'chr1\t783006\t'
chr1 783006 . A G 50 PASS platforms=4;platformnames=PacBio,10X,Illumina,CG;datasets=6;datasetnames=CCS15kb_20kb,10XChromiumLR,HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=10;callsetnames=CCS15kb_20kbDV,10XLRGATK,CCS15kb_20kbGATK4,HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome;callable=CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable;filt=CS_CGnormal_filt,CS_HiSeqPE300xfreebayes_filt;difficultregion=AJtrio-HG003.hg38.300x.bam.bilkentuniv.072319.dups,hg38.segdups_sorted_merged,lowmappabilityall GT:PS:DP:ADALL:AD:GQ 1/1:.:803:15,330:0,101:494
Just from looking at the entry, without digging into it, I would think that this entry should show up at the end.
I'll take a closer look (and run through my own gist again) see if I can figure out.
Hi Pi-Chuan and @genieusbio,
Try it with --regions chr1:783006-783007
- you might see magic happen :)
$ zcat output/HG003.output.vcf.gz | grep -P 'chr1\t783006\t'
chr1 783006 . A G 0 RefCall . GT:GQ:DP:AD:VAF:PL 0/0:30:0:0,0:0:0,33,33
$ zcat output/HG003.output.g.vcf.gz | grep -P 'chr1\t783006\t'
chr1 783006 . A G,<*> 0 RefCall . GT:GQ:DP:AD:VAF:PL 0/0:30:0:0,0,0:0,0:0,33,33,990,990,990
My preference would be to update the BED file with that region.
~p
I see. @pgrosu are you saying that if we just grep the output HG003.output.vcf.gz
file (like I did with the input), we'll see it?
I ended up not having time to test it last night. So I haven't gone through a run to check.
@genieusbio if you can confirm that, that will be great! I can also check later. I still need to work on a few other things before I get to this.
Yes, I can give a go.
Hi @genieusbio,
This first run is just a validation. So try it like in the following script with --regions chr1:783006-783007
and --num_shards 1
as well, as shown below. You might need to change your input/output/reference folders reflecting what you have on your system:
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.1.0"
MAKE_EXAMPLE_ARGS="variant_caller=vcf_candidate_importer,proposed_variants=/input/force_calling.candidates.vcf.gz,vsc_min_count_snps=0,vsc_min_count_indels=0,vsc_min_fraction_snps=0,vsc_min_fraction_indels=0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
--regions chr1:783006-783007 \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards 1 \
--make_examples_extra_args="${MAKE_EXAMPLE_ARGS}" \
--intermediate_results_dir /output/intermediate_results_dir
Let me know if there is anything I should expand for clarification. You might have to run it a couple of times to make it all go, but it should run fairly fast as the region is very small.
Thanks, Paul
Hi Pi-Chuan,
Yes, that is correct - you should see it in the generated output files (VCF and GVCF). The control-flow logic for finding these candidates is as follows for r1.1 of make_examples.py:
$1)
$ On line 613, options.calling_regions
is extended with the values from the user-defined --regions
flag:
options.calling_regions.extend(parse_regions_flag(flags_obj.regions))
$2)
$ On line 2082, in function make_examples_runner()
a request is made to provide these regions for processing:
regions = processing_regions_from_options(options)
$3)
$ On lines 1960-1961, in function processing_regions_from_options()
, the variable calling_regions
is initialized based on the contents of options.calling_regions
and options.exclude_calling_regions
:
calling_regions = build_calling_regions(ref_contigs, options.calling_regions,
options.exclude_calling_regions)
Which is then used to construct the regions
and region_list
on lines 1968-1975:
regions = regions_to_process(
contigs=contigs,
partition_size=options.allele_counter_options.partition_size,
calling_regions=calling_regions,
task_id=options.task_id,
num_shards=options.num_shards)
region_list = list(regions)
$4)
$ Continuing with function make_examples_runner()
, on lines 2105-2106 it finds candidates based on these regions, creating also the corresponding example files:
for region in regions:
candidates, examples, gvcfs, runtimes = region_processor.process(region)
Hope it helps, Paul
Hi @genieusbio ,
One thing to note:
In https://gist.github.com/pichuan/baba6ee9bd9890be2a45076a4934dd38 , when you run run_deepvariant
, this flag is specified: --regions /input/idt_capture_novogene.grch38.bed
.
Which means only the variants within the regions specified in the BED is used.
If you look at the BED file:
$ head -5 ./input/idt_capture_novogene.grch38.bed
chr1 69090 70008
chr1 450739 451678
chr1 685715 686654
chr1 925941 926013
chr1 930154 930336
which does not include the region you're looking for.
If you want to force call everything in that BAM file, you can simply remove the flag --regions /input/idt_capture_novogene.grch38.bed
from your run. The downside is that will take longer.
Or, if you just want to make sure that particular region is covered, then you can use @pgrosu 's suggestion and specify a small region that covers that range.
Hopefully this is clear.
Sometimes in variant calling, it's necessary to force genotype all known sites regardless if a variant is found. Illumina has it:
GATK also has a similar option.
How to do it in DeepVariant? I followed a DeepVariant tutorial at:
Unfortunately it didn't look like working for me. For example,
chr1:783006
was in theforce_calling.candidates.vcf.gz
file but the site is not reported in the two output files:HG003.output.g.vcf.gz
andHG003.output.vcf.gz
.