google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

run deepvariant using ultima data #711

Closed lykstudio closed 1 year ago

lykstudio commented 1 year ago

Hello,

I'm trying to run DeepVariant using ultima data (cram file). I get information that deepvariant tool provides --enable_joint_realignment and --p_error in DeepVariant 1.5.0 release page. But, I got error message when I am trying to use --enable_joint_realignment options..

Can I get some advice which custom channels or options I should use to run deepvariant using ultima data?

doron-st commented 1 year ago

Hi @lykstudio,

Ultima has a separate fork of deepvariant. It's not completely public yet. Please contact me at doron.shemtov@ultimagen.com for further details for early access.

See usage instructions below.

Best, Doron

Generating unfiltered germline callset

Requirements

The workflow below assumes that you have a sorted, duplicate marked UG BAM/CRAM file. We also assume that you built the UG DeepVariant docker (>=1.4.12) per separate instructions file. Official gatk and picard installations are required.

Files required for the analysis (download locally)

The following files are publicly available:

gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict
gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list

UG-specific files

DeepVariant model files

DeepVariant-ug docker container:

gcr.io/ganymede-331016/deepvariant:ug-1.4.13

Generating whole genome callset

Instructions below require ability to run whole genome DeepVariant. DeepVariant is split into 3 stages: make_examples, call_variants, and postprocess_variants. We normally run make_examples on a genome split into 200 equal intervals in parallel through cromwell engine, but any parallelization engine will do. Each make_examples process creates examples.tfrecords.gz file which contains a single image and variant metadata for each candidate. The set of tfrecords file is read by call_variants as a single tensorflow dataset. call_variants runs optimally on a machine which contains a single gpu, such as nvidia-A10, or nvidia-v100. The output of call_variants is a single tfrecords.gz file which contains the call probabilities for each candidates. postprocess_variants parses the output of call_varaiants and converts it to a vcf file.

Running make_examples

Generate scattered intervals

mkdir out && \
picard \
  IntervalListTools \
  SCATTER_COUNT=200 \
  SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
  UNIQUE=true \
  SORT=true \
  BREAK_BANDS_AT_MULTIPLES_OF=100000 \
  INPUT=wgs_calling_regions.hg38.interval_list \
  OUTPUT=out

Further steps describe running on a single interval

Required hardware:

Select reads from only a single area of the genome. In our system it is necessary to avoid unnecessary multiple copies of large BAM/CRAM files

gatk --java-options "-Xms1G" PrintReads \
            -I {input_bam} \
            -O input.bam \
            -L {interval} \
            -R Homo_sapiens_assembly38.fasta

# DeepVariant receives intervals in bed format
gatk IntervalListToBed -I {interval} -O interval.bed

Run make_examples on a single resulting interval to generate examples tfrecords and an optionl gVCF tfrecords files

/opt/deepvariant/bin/make_examples \
      --mode calling \
      --ref ~{ref_fasta} \
      --reads input.bam \
      --regions out/temp_0001_of_200/scattered.interval_list \
      --examples interval001.tfrecord.gz \
      --min_base_quality 5 \
      --dbg_min_base_quality 0 \
      --vsc_min_fraction_indels 0.06 \
      --vsc_min_fraction_hmer_indels 0.12 \
      --vsc_min_fraction_snps 0.12 \
      --vsc_min_count_snps 2 \
      --ws_min_windows_distance 20 \
      --min_mapping_quality 5 \
      --candidate_min_mapping_quality 5 \
      --max_reads_per_partition 1500 \
      --aux_fields_to_keep tp,t0 \
      --skip_bq_channel \
      --channels hmer_deletion_quality,hmer_insertion_quality,non_hmer_insertion_quality \
      --add_ins_size_channel \
      --max_ins_size 10 \
      --optimal_coverages 50 \
      --p_error 0.005 \
      --gvcf out_temp_0001_of_200_gvcf.tfrecords.gz \

Run CallVariants

CallVariants step runs on all examples from all samples together Before running CallVariants, all the output example files should be copied to the same instance and renamed in the format examples.tfrecord-%05d-of-%05d.gz, for example: examples.tfrecord-00041-of-00055.gz. The order of the file names should correspond to the genomic order and the first index is 00000.

Required hardware:

command:

/opt/deepvariant/bin/call_variants \
  --outfile call_variants_output.tfrecord.gz \
  --examples examples.tfrecord@<fill total number of shards>.gz \
  -checkpoint <model name, without the .index extension: deepvariant-ultima-germline-wgs-model-v1.2.ckpt-710000>

Run PostProcessing:

Required hardware:

command:

/opt/deepvariant/bin/postprocess_variants \
  --ref Homo_sapiens_assembly38.fasta \
  --infile call_variants_output.tfrecord.gz \
  --outfile output.vcf.gz \
  --vcf_stats_report=False

Final notes

AndrewCarroll commented 1 year ago

Hi @lykstudio

Just to quickly comment. The Google team has worked closely with the Ultima team and @doron-st and we can endorse their solutions for Ultima data.