lykstudio commented 1 year ago

Hello,

I'm trying to run DeepVariant using ultima data (cram file). I get information that deepvariant tool provides --enable_joint_realignment and --p_error in DeepVariant 1.5.0 release page. But, I got error message when I am trying to use --enable_joint_realignment options..

Can I get some advice which custom channels or options I should use to run deepvariant using ultima data?

doron-st commented 1 year ago

Hi @lykstudio,

Ultima has a separate fork of deepvariant. It's not completely public yet. Please contact me at doron.shemtov@ultimagen.com for further details for early access.

See usage instructions below.

Best, Doron

Generating unfiltered germline callset

Requirements

The workflow below assumes that you have a sorted, duplicate marked UG BAM/CRAM file. We also assume that you built the UG DeepVariant docker (>=1.4.12) per separate instructions file. Official gatk and picard installations are required.

Files required for the analysis (download locally)

The following files are publicly available:

gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict
gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list

UG-specific files

DeepVariant model files

WGS calling (model 1.2)

gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.data-00000-of-00001
gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.index
gs://concordanz/deepvariant/model/germline/v1.2_rc2/model.ckpt-710000.meta

DeepVariant-ug docker container:

gcr.io/ganymede-331016/deepvariant:ug-1.4.13

Generating whole genome callset

Instructions below require ability to run whole genome DeepVariant. DeepVariant is split into 3 stages: make_examples, call_variants, and postprocess_variants. We normally run make_examples on a genome split into 200 equal intervals in parallel through cromwell engine, but any parallelization engine will do. Each make_examples process creates examples.tfrecords.gz file which contains a single image and variant metadata for each candidate. The set of tfrecords file is read by call_variants as a single tensorflow dataset. call_variants runs optimally on a machine which contains a single gpu, such as nvidia-A10, or nvidia-v100. The output of call_variants is a single tfrecords.gz file which contains the call probabilities for each candidates. postprocess_variants parses the output of call_varaiants and converts it to a vcf file.

In the instructions below deepVariant is being run within the docker container

Running make_examples

Generate scattered intervals

mkdir out && \
picard \
  IntervalListTools \
  SCATTER_COUNT=200 \
  SUBDIVISION_MODE=BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
  UNIQUE=true \
  SORT=true \
  BREAK_BANDS_AT_MULTIPLES_OF=100000 \
  INPUT=wgs_calling_regions.hg38.interval_list \
  OUTPUT=out

Further steps describe running on a single interval

Required hardware:

1CPU/shard,
2.5GB RAM/shard,
hard disk: 50GB/shard+size_of_bam_files/num_shards

Select reads from only a single area of the genome. In our system it is necessary to avoid unnecessary multiple copies of large BAM/CRAM files

gatk --java-options "-Xms1G" PrintReads \
            -I {input_bam} \
            -O input.bam \
            -L {interval} \
            -R Homo_sapiens_assembly38.fasta

# DeepVariant receives intervals in bed format
gatk IntervalListToBed -I {interval} -O interval.bed

Run make_examples on a single resulting interval to generate examples tfrecords and an optionl gVCF tfrecords files

/opt/deepvariant/bin/make_examples \
      --mode calling \
      --ref ~{ref_fasta} \
      --reads input.bam \
      --regions out/temp_0001_of_200/scattered.interval_list \
      --examples interval001.tfrecord.gz \
      --min_base_quality 5 \
      --dbg_min_base_quality 0 \
      --vsc_min_fraction_indels 0.06 \
      --vsc_min_fraction_hmer_indels 0.12 \
      --vsc_min_fraction_snps 0.12 \
      --vsc_min_count_snps 2 \
      --ws_min_windows_distance 20 \
      --min_mapping_quality 5 \
      --candidate_min_mapping_quality 5 \
      --max_reads_per_partition 1500 \
      --aux_fields_to_keep tp,t0 \
      --skip_bq_channel \
      --channels hmer_deletion_quality,hmer_insertion_quality,non_hmer_insertion_quality \
      --add_ins_size_channel \
      --max_ins_size 10 \
      --optimal_coverages 50 \
      --p_error 0.005 \
      --gvcf out_temp_0001_of_200_gvcf.tfrecords.gz \

Run CallVariants

CallVariants step runs on all examples from all samples together Before running CallVariants, all the output example files should be copied to the same instance and renamed in the format examples.tfrecord-%05d-of-%05d.gz, for example: examples.tfrecord-00041-of-00055.gz. The order of the file names should correspond to the genomic order and the first index is 00000.

Required hardware:

4 CPU
1 GPU (nvidia-tesla-p100 or nvidia-tesla-v100 or nvidia-amper-a10)
2.5GB RAM
hard disk: 2 total size of example + 2 total size of model files

command:

/opt/deepvariant/bin/call_variants \
  --outfile call_variants_output.tfrecord.gz \
  --examples examples.tfrecord@<fill total number of shards>.gz \
  -checkpoint <model name, without the .index extension: deepvariant-ultima-germline-wgs-model-v1.2.ckpt-710000>

Run PostProcessing:

Required hardware:

1 CPU
RAM: 4GB+64*size of the output of CallVariant
Disk: 4GB + 48*size of the output of CallVariants + size of the reference genome

command:

/opt/deepvariant/bin/postprocess_variants \
  --ref Homo_sapiens_assembly38.fasta \
  --infile call_variants_output.tfrecord.gz \
  --outfile output.vcf.gz \
  --vcf_stats_report=False

Final notes

Running deepVariant in the docker requires ability to use GPU from the docker. See instructions.

AndrewCarroll commented 1 year ago

Hi @lykstudio

Just to quickly comment. The Google team has worked closely with the Ultima team and @doron-st and we can endorse their solutions for Ultima data.

google / deepvariant

run deepvariant using ultima data #711

Generating unfiltered germline callset

Requirements

Files required for the analysis (download locally)

UG-specific files

Generating whole genome callset

Running make_examples

Generate scattered intervals

Further steps describe running on a single interval

Run make_examples on a single resulting interval to generate examples tfrecords and an optionl gVCF tfrecords files

Run CallVariants

Run PostProcessing:

Final notes