Closed lykstudio closed 1 year ago
Hi @lykstudio,
Ultima has a separate fork of deepvariant. It's not completely public yet. Please contact me at doron.shemtov@ultimagen.com for further details for early access.
See usage instructions below.
Best, Doron
The workflow below assumes that you have a sorted, duplicate marked UG BAM/CRAM file. We also assume that you built the UG DeepVariant docker (>=1.4.12) per separate instructions file. Official gatk and picard installations are required.
The following files are publicly available:
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict
gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list
DeepVariant model files
gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.data-00000-of-00001
gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.index
gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.meta
DeepVariant-ug docker container:
gcr.io/ganymede-331016/deepvariant:ug-1.4.13
Instructions below require ability to run whole genome DeepVariant. DeepVariant is split into 3 stages: make_examples, call_variants, and postprocess_variants. We normally run make_examples on a genome split into 200 equal intervals in parallel through cromwell engine, but any parallelization engine will do. Each make_examples process creates examples.tfrecords.gz file which contains a single image and variant metadata for each candidate. The set of tfrecords file is read by call_variants as a single tensorflow dataset. call_variants runs optimally on a machine which contains a single gpu, such as nvidia-A10, or nvidia-v100. The output of call_variants is a single tfrecords.gz file which contains the call probabilities for each candidates. postprocess_variants parses the output of call_varaiants and converts it to a vcf file.
mkdir out && \
picard \
IntervalListTools \
SCATTER_COUNT=200 \
SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
UNIQUE=true \
SORT=true \
BREAK_BANDS_AT_MULTIPLES_OF=100000 \
INPUT=wgs_calling_regions.hg38.interval_list \
OUTPUT=out
Required hardware:
Select reads from only a single area of the genome. In our system it is necessary to avoid unnecessary multiple copies of large BAM/CRAM files
gatk --java-options "-Xms1G" PrintReads \
-I {input_bam} \
-O input.bam \
-L {interval} \
-R Homo_sapiens_assembly38.fasta
# DeepVariant receives intervals in bed format
gatk IntervalListToBed -I {interval} -O interval.bed
/opt/deepvariant/bin/make_examples \
--mode calling \
--ref ~{ref_fasta} \
--reads input.bam \
--regions out/temp_0001_of_200/scattered.interval_list \
--examples interval001.tfrecord.gz \
--min_base_quality 5 \
--dbg_min_base_quality 0 \
--vsc_min_fraction_indels 0.06 \
--vsc_min_fraction_hmer_indels 0.12 \
--vsc_min_fraction_snps 0.12 \
--vsc_min_count_snps 2 \
--ws_min_windows_distance 20 \
--min_mapping_quality 5 \
--candidate_min_mapping_quality 5 \
--max_reads_per_partition 1500 \
--aux_fields_to_keep tp,t0 \
--skip_bq_channel \
--channels hmer_deletion_quality,hmer_insertion_quality,non_hmer_insertion_quality \
--add_ins_size_channel \
--max_ins_size 10 \
--optimal_coverages 50 \
--p_error 0.005 \
--gvcf out_temp_0001_of_200_gvcf.tfrecords.gz \
CallVariants step runs on all examples from all samples together
Before running CallVariants, all the output example files should be copied to
the same instance and renamed in the format examples.tfrecord-%05d-of-%05d.gz
, for example: examples.tfrecord-00041-of-00055.gz
.
The order of the file names should correspond to the genomic order and the first index is 00000.
Required hardware:
command:
/opt/deepvariant/bin/call_variants \
--outfile call_variants_output.tfrecord.gz \
--examples examples.tfrecord@<fill total number of shards>.gz \
-checkpoint <model name, without the .index extension: deepvariant-ultima-germline-wgs-model-v1.2.ckpt-710000>
Required hardware:
command:
/opt/deepvariant/bin/postprocess_variants \
--ref Homo_sapiens_assembly38.fasta \
--infile call_variants_output.tfrecord.gz \
--outfile output.vcf.gz \
--vcf_stats_report=False
Hi @lykstudio
Just to quickly comment. The Google team has worked closely with the Ultima team and @doron-st and we can endorse their solutions for Ultima data.
Hello,
I'm trying to run DeepVariant using ultima data (cram file). I get information that deepvariant tool provides --enable_joint_realignment and --p_error in DeepVariant 1.5.0 release page. But, I got error message when I am trying to use --enable_joint_realignment options..
Can I get some advice which custom channels or options I should use to run deepvariant using ultima data?