google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.16k stars 710 forks source link

Error running DeepVariant on non-human model organism #292

Closed moldach closed 4 years ago

moldach commented 4 years ago

I've had success following the Getting started guide with both CPU and GPU on the example datasets and now I'm trying to run the CPU version on my own data, C. elegans, but am getting an error:

Submission script for example

#!/bin/bash
#SBATCH --job-name=example_DV
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000
#SBATCH --time=0:20:0
#SBATCH --account=def-mtarailo
#SBATCH --output=/scratch/moldach/bin/DEEPVARIANT/logs/deepVar_example_%j.out
#SBATCH --error=/scratch/moldach/bin/DEEPVARIANT/logs/deepVar_example_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=moldach@ucalgary.ca

module load singularity

BIN_VERSION="0.10.0"
INPUT_DIR="/scratch/moldach/bin/DEEPVARIANT/quickstart-testdata"
OUTPUT_DIR="/scratch/moldach/bin/DEEPVARIANT/cpu-1cpu"
mkdir -p "${OUTPUT_DIR}"

# Pull the image.
singularity pull docker://google/deepvariant:"${BIN_VERSION}"

# Run DeepVariant.
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
        docker://google/deepvariant:"${BIN_VERSION}" \
        /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
        --reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
        --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
        --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
        --num_shards=1

Submission script for C. elegans

#!/bin/bash
#SBATCH --job-name=Celegans_DeepVar
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000
#SBATCH --time=0:20:0
#SBATCH --account=def-mtarailo
#SBATCH --output=/scratch/moldach/bin/DEEPVARIANT/logs/deepVar_Celegans_%j.out
#SBATCH --error=/scratch/moldach/bin/DEEPVARIANT/logs/deepVar_Celegans_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=moldach@ucalgary.ca

module load singularity

BIN_VERSION="0.10.0"
INPUT_DIR="/scratch/moldach/bin/DEEPVARIANT/MADDOG"
OUTPUT_DIR="/scratch/moldach/bin/DEEPVARIANT/celegans"
mkdir -p "${OUTPUT_DIR}"

# Pull the image.
singularity pull docker://google/deepvariant:"${BIN_VERSION}"

# Run DeepVariant.
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
        docker://google/deepvariant:"${BIN_VERSION}" \
        /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref="${INPUT_DIR}"/c_elegans.PRJEB28388.WS274.genomic.fa \
        --reads="${INPUT_DIR}"/maddog_bam_trim_bwaMEM_sort_dedupped.bam \
        --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
        --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
        --num_shards=1

The error looks like:

FATAL:   Image file already exists: "deepvariant_0.10.0.sif" - will not overwrite
time="2020-03-31T17:40:13-07:00" level=warning msg="\"/run/user/3019658\" directory set by $XDG_RUNTIME_DIR does not exist. Either create the directory or unset $XDG_RUNTIME_DIR.: stat /run/user/3019658: no such file or directory: Trying to pull image in the event that it is a public image."
I0331 17:40:16.049175 47823917316800 run_deepvariant.py:241] Re-using the directory for intermediate results in /tmp/tmpl3fvinw4
I0331 17:40:24.867229 47384002755264 make_examples.py:386] ReadRequirements are: min_mapping_quality: 10
min_base_quality: 10
min_base_quality_mode: ENFORCED_BY_CLIENT

I0331 17:40:25.244051 47384002755264 genomics_reader.py:223] Reading /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam with NativeSamReader
I0331 17:40:25.256583 47384002755264 make_examples.py:535] Preparing inputs
I0331 17:40:25.453527 47384002755264 genomics_reader.py:223] Reading /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam with NativeSamReader
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1510, in <module>
    app.run(main)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/absl_py/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/absl_py/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1500, in main
    make_examples_runner(options)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1358, in make_examples_runner
    regions = processing_regions_from_options(options)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1279, in processing_regions_from_options
    options.min_shared_contigs_basepairs)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 647, in _ensure_consistent_contigs
    min_coverage_fraction)
  File "/tmp/Bazel.runfiles_p8uucuhy/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 725, in validate_reference_contig_coverage
    ref_bp, common_bp, coverage, format_contig_matches()))
ValueError: Reference contigs span 102092263 bases but only 0 bases (0.00%) were found in common among our input files. Check that the sources were created on a common genome reference build. Contig matches were: 
"chrV_pilon" is 21243235 bp and IS MISSING, 
"chrX_pilon" is 18110855 bp and IS MISSING, 
"chrIV_pilon" is 17759200 bp and IS MISSING, 
"chrII_pilon" is 15525148 bp and IS MISSING, 
"chrI_pilon" is 15331301 bp and IS MISSING, 
"chrIII_pilon" is 14108536 bp and IS MISSING, 
"chrM_pilon" is 13988 bp and IS MISSING. Here is a useful article about different human genome reference builds:
https://gatkforums.broadinstitute.org/gatk/discussion/11010/human-genome-reference-builds-grch38-hg38-b37-hg19
Please make sure the --ref input matches the build used for the input in --reads.
parallel: This job failed:
/opt/deepvariant/bin/make_examples --mode calling --ref /scratch/moldach/bin/DEEPVARIANT/MADDOG/c_elegans.PRJEB28388.WS274.genomic.fa --reads /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam --examples /tmp/tmpl3fvinw4/make_examples.tfrecord@1.gz --gvcf /tmp/tmpl3fvinw4/gvcf.tfrecord@1.gz --task 0

real    0m9.819s
user    0m2.982s
sys 0m2.851s
I0331 17:40:25.872794 47823917316800 run_deepvariant.py:321] None
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 332, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 319, in main
    subprocess.check_call(command, shell=True, executable='/bin/bash')
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'time seq 0 0 | parallel --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/scratch/moldach/bin/DEEPVARIANT/MADDOG/c_elegans.PRJEB28388.WS274.genomic.fa" --reads "/scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam" --examples "/tmp/tmpl3fvinw4/make_examples.tfrecord@1.gz" --gvcf "/tmp/tmpl3fvinw4/gvcf.tfrecord@1.gz" --task {}' returned non-zero exit status 1.

This is what my input directory looks like:

c_elegans.PRJEB28388.WS274.genomic.fa
c_elegans.PRJEB28388.WS274.genomic.fa.fai
maddog_bam_trim_bwaMEM_sort_dedupped.bam
maddog_bam_trim_bwaMEM_sort_dedupped.bam.bai

I noticed there are a few more input files in the sample example quickstart-input; is it possible the error is caused by that?

NA12878_S1.chr20.10_10p1mb.bam
NA12878_S1.chr20.10_10p1mb.bam.bai
test_nist.b37_chr20_100kbp_at_10mb.bed
test_nist.b37_chr20_100kbp_at_10mb.vcf.gz
test_nist.b37_chr20_100kbp_at_10mb.vcf.gz.tbi
ucsc.hg19.chr20.unittest.fasta
ucsc.hg19.chr20.unittest.fasta.fai
ucsc.hg19.chr20.unittest.fasta.gz
ucsc.hg19.chr20.unittest.fasta.gz.fai
ucsc.hg19.chr20.unittest.fasta.gz.gzi

Trying to fill in the missing input files

I used bgzip to convert to gzip and faidx to get the .fai/.gzi files:

module load nixpkgs/16.09
module load gcc/7.3.0
module load samtools/1.9
bgzip c_elegans.PRJEB28388.WS274.genomic.fa
samtools faidx c_elegans.PRJEB28388.WS274.genomic.fa.gz

Next I download the .gff3 annotation from and converted it to .bed format:

module load nixpkgs/16.09
module load gcc/6.4.0
module load bedops/2.4.35

wget ftp://ftp.wormbase.org/pub/wormbase/releases/WS274/species/c_elegans/PRJEB28388/c_elegans.PRJEB28388.WS274.annotations.gff3.gz
bgzip -d c_elegans.PRJEB28388.WS274.annotations.gff3.gz
gff2bed < c_elegans.PRJEB28388.WS274.annotations.gff3 > c_elegans.PRJEB28388.WS274.annotations.bed
rm c_elegans.PRJEB28388.WS274.annotations.gff3

The .vcf.gz file I download from CeNDR (comparable to the DGV database in humans) then generate its index file vcf.gz.tbi:

wget https://storage.googleapis.com/elegansvariation.org/releases/20180527/variation/WI.20180527.impute.vcf.gz
module load nixpkgs/16.09
module load gcc/7.3.0
module load htslib/1.9
tabix -p vcf WI.20180527.impute.vcf.gz

Now my input directory looks like:

maddog_bam_trim_bwaMEM_sort_dedupped.bam
maddog_bam_trim_bwaMEM_sort_dedupped.bam.bai
c_elegans.PRJEB28388.WS274.annotations.bed
WI.20180527.impute.vcf.gz
WI.20180527.impute.vcf.gz.tbi
c_elegans.PRJEB28388.WS274.genomic.fa
c_elegans.PRJEB28388.WS274.genomic.fa.fai
c_elegans.PRJEB28388.WS274.genomic.fa.gz
c_elegans.PRJEB28388.WS274.genomic.fa.gz.fai
c_elegans.PRJEB28388.WS274.genomic.fa.gz.gzi

Now that I think I have all the appropriate input files in my INPUT_DIR I will try to run the code again:

FATAL:   Image file already exists: "deepvariant_0.10.0.sif" - will not overwrite
time="2020-03-31T18:35:24-07:00" level=warning msg="\"/run/user/3019658\" directory set by $XDG_RUNTIME_DIR does not exist. Either create the directory or unset $XDG_RUNTIME_DIR.: stat /run/user/3019658: no such file or directory: Trying to pull image in the event that it is a public image."
I0331 18:35:27.010462 47712747653824 run_deepvariant.py:241] Re-using the directory for intermediate results in /tmp/tmpmglc7_7t
I0331 18:35:39.745799 47929040407232 make_examples.py:386] ReadRequirements are: min_mapping_quality: 10
min_base_quality: 10
min_base_quality_mode: ENFORCED_BY_CLIENT

I0331 18:35:39.846203 47929040407232 genomics_reader.py:223] Reading /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam with NativeSamReader
I0331 18:35:39.857604 47929040407232 make_examples.py:535] Preparing inputs
I0331 18:35:39.897673 47929040407232 genomics_reader.py:223] Reading /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam with NativeSamReader
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1510, in <module>
    app.run(main)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/absl_py/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/absl_py/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1500, in main
    make_examples_runner(options)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1358, in make_examples_runner
    regions = processing_regions_from_options(options)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1279, in processing_regions_from_options
    options.min_shared_contigs_basepairs)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 647, in _ensure_consistent_contigs
    min_coverage_fraction)
  File "/tmp/Bazel.runfiles_0muv5ovo/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 725, in validate_reference_contig_coverage
    ref_bp, common_bp, coverage, format_contig_matches()))
ValueError: Reference contigs span 102092263 bases but only 0 bases (0.00%) were found in common among our input files. Check that the sources were created on a common genome reference build. Contig matches were: 
"chrV_pilon" is 21243235 bp and IS MISSING, 
"chrX_pilon" is 18110855 bp and IS MISSING, 
"chrIV_pilon" is 17759200 bp and IS MISSING, 
"chrII_pilon" is 15525148 bp and IS MISSING, 
"chrI_pilon" is 15331301 bp and IS MISSING, 
"chrIII_pilon" is 14108536 bp and IS MISSING, 
"chrM_pilon" is 13988 bp and IS MISSING. Here is a useful article about different human genome reference builds:
https://gatkforums.broadinstitute.org/gatk/discussion/11010/human-genome-reference-builds-grch38-hg38-b37-hg19
Please make sure the --ref input matches the build used for the input in --reads.
parallel: This job failed:
/opt/deepvariant/bin/make_examples --mode calling --ref /scratch/moldach/bin/DEEPVARIANT/MADDOG/c_elegans.PRJEB28388.WS274.genomic.fa --reads /scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam --examples /tmp/tmpmglc7_7t/make_examples.tfrecord@1.gz --gvcf /tmp/tmpmglc7_7t/gvcf.tfrecord@1.gz --task 0

real    0m13.428s
user    0m3.470s
sys 0m3.116s
I0331 18:35:40.443594 47712747653824 run_deepvariant.py:321] None
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 332, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 319, in main
    subprocess.check_call(command, shell=True, executable='/bin/bash')
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'time seq 0 0 | parallel --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/scratch/moldach/bin/DEEPVARIANT/MADDOG/c_elegans.PRJEB28388.WS274.genomic.fa" --reads "/scratch/moldach/bin/DEEPVARIANT/MADDOG/maddog_bam_trim_bwaMEM_sort_dedupped.bam" --examples "/tmp/tmpmglc7_7t/make_examples.tfrecord@1.gz" --gvcf "/tmp/tmpmglc7_7t/gvcf.tfrecord@1.gz" --task {}' returned non-zero exit status 1.
AndrewCarroll commented 4 years ago

Hi @moldach,

The error message indicates that the sequence contig names present in the reference genome don't match those in the BAM file.

It would be useful to know the names and lengths of the contigs in the BAM header. Are you able to provide the output of this command:

samtools view -H maddog_bam_trim_bwaMEM_sort_dedupped.bam | grep \@SQ

Thank you

moldach commented 4 years ago

Hi @AndrewCarroll the output for samtools view -H maddog_bam_trim_bwaMEM_sort_dedupped.bam | grep @SQ is

@SQ     SN:I    LN:15072434
@SQ     SN:II   LN:15279421
@SQ     SN:III  LN:13783801
@SQ     SN:IV   LN:17493829
@SQ     SN:V    LN:20924180
@SQ     SN:X    LN:17718942
@SQ     SN:MtDNA        LN:13794

The alignment was done by a summer student working under the guidance of a post-doc (she recently had a baby so I didn't want to bother her) and it's possible a different reference version was used.

In the error above I had used c_elegans.PRJEB28388.WS274.genomic.fa.

We have another reference genome on our server so I tried it in-case this was the issue. Using c_elegans.PRJNA13758.WS265.genomic.fa deepvariant ran for more than 3 hours and then failed on Chromosome V with this error:

...
I0402 20:41:44.332051 47425001081536 make_examples.py:535] 25500 candidates (27904 examples) [26.82s elapsed]
I0402 20:42:04.004627 47425001081536 make_examples.py:535] 25600 candidates (28010 examples) [19.67s elapsed]
I0402 20:42:27.991226 47425001081536 make_examples.py:535] 25700 candidates (28130 examples) [23.99s elapsed]
I0402 20:42:35.967661 47425001081536 make_examples.py:535] 25813 candidates (28251 examples) [7.98s elapsed]
I0402 20:42:48.188316 47425001081536 make_examples.py:535] 25911 candidates (28355 examples) [12.22s elapsed]
I0402 20:42:49.405055 47425001081536 make_examples.py:535] 26014 candidates (28458 examples) [1.22s elapsed]
[E::fai_retrieve] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?)
2020-04-02 20:46:28.318323: F ./third_party/nucleus/vendor/statusor.h:231] Non-OK-status: status_ status: Invalid argument: Couldn't fetch bases for reference_name: "V" start: 5524980 end: 5526020
Fatal Python error: Aborted

Current thread 0x00002b21fe57cac0 (most recent call first):
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/realigner/window_selector.py", line 73 in _candidates_from_reads
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/realigner/window_selector.py", line 237 in select_windows
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/realigner/realigner.py", line 574 in realign_reads
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1129 in region_reads
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1055 in process
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1377 in make_examples_runner
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1500 in main
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/absl_py/absl/app.py", line 251 in _run_main
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/absl_py/absl/app.py", line 300 in run
  File "/tmp/Bazel.runfiles_e68bsnf0/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1510 in <module>
parallel: This job failed:
/opt/deepvariant/bin/make_examples --mode calling --ref /scratch/moldach/bin/DEEPVARIANT/Wormbase/c_elegans.PRJNA13758.WS265.genomic.fa --reads /scratch/moldach/bin/DEEPVARIANT/Wormbase/maddog_bam_trim_bwaMEM_sort_dedupped.bam --examples /tmp/tmp5lcsp7ms/make_examples.tfrecord@1.gz --gvcf /tmp/tmp5lcsp7ms/gvcf.tfrecord@1.gz --task 0

real    81m0.387s
user    66m46.794s
sys     0m15.672s
I0402 20:46:31.181916 47643471256256 run_deepvariant.py:321] None
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 332, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 319, in main
    subprocess.check_call(command, shell=True, executable='/bin/bash')
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'time seq 0 0 | parallel --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/scratch/moldach/bin/DEEPVARIANT/Wormbase/c_elegans.PRJNA13758.WS265.genomic.fa" --reads "/scratch/moldach/bin/DEEPVARIANT/Wormbase/maddog_bam_trim_bwaMEM_sort_dedupped.bam" --examples "/tmp/tmp5lcsp7ms/make_examples.tfrecord@1.gz" --gvcf "/tmp/tmp5lcsp7ms/gvcf.tfrecord@1.gz" --task {}' returned non-zero exit status 250.

Any idea why this could be happening (still wrong reference possibly?). Should we keep debugging this or would the safest thing be to re-run the alignment on a known ref and then try again? Thanks for your help

AndrewCarroll commented 4 years ago

Hi @moldach

Thank you for the header data. The chromosome names in the BAM (I, I, III, IV, V, X, MtDNA) differ from those in your original reference file (chrI_pilon, chrII_pilon), etc... This is what is causing the first error, DeepVariant requires chromosome names to match in order to run.

The second error is a bit harder to diagnose, but I would guess this is because the sequence of your reference genome does not extend as far as some of the mapping positions in the BAM file. My suspicion is that the reference length is mismatched, but it might be possible that the reference sequence ends in real genomic DNA that reads are mapping to (and then extending beyond the reference length) and this causes DeepVariant to fail (I haven't seen this before, but it could be a plausible behavior).

I think the safest thing is to remap the reads to what you know for sure is the same reference you will call on, and let me know if this issue remains.

Thanks, Andrew

bilgehannevruz commented 4 years ago

Hello, I had a similar issue

since the exit code is 5, I tried to fix with chmod +644 or other codes however it didn't work.

My reference is different vertebrate viruses that I merged via cat And reads are, not mapped reads to human.Then mapped to viral genome.

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1461, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1451, in main
    make_examples_runner(options)
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1341, in make_examples_runner
    candidates, examples, gvcfs = region_processor.process(region)
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1041, in process
    for example in self.create_pileup_examples(candidate)
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 1141, in create_pileup_examples
    pileup_images = self.pic.create_pileup_images(dv_call)
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/pileup_image.py", line 345, in create_pileup_images
    return [_make_one(alts) for alts in self._alt_allele_combinations(variant)]
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/pileup_image.py", line 342, in _make_one
    image = self.build_pileup(dv_call, ref, reads, alt_alleles)
  File "/tmp/Bazel.runfiles_6B0qe2/runfiles/com_google_deepvariant/deepvariant/pileup_image.py", line 277, in build_pileup
    dv_call.variant)
ValueError: ('center of refbases doesnt match variant.refbases', 110, 'C', reference_bases: "T"
alternate_bases: "C"
calls {
  info {
    key: "AD"
    value {
      values {
        int_value: 0
      }
      values {
        int_value: 3
      }
    }
  }
  info {
    key: "DP"
    value {
      values {
        int_value: 3
      }
    }
  }
  info {
    key: "VAF"
    value {
      values {
        number_value: 1.0
      }
    }
  }
  genotype: -1
  genotype: -1
  call_set_name: "default"
}
end: 3002
reference_name: "NC_022518.1"
start: 3001
)

real    0m28.014s
user    1m56.205s
sys     0m16.675s
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 317, in <module>
    app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/opt/deepvariant/bin/run_deepvariant.py", line 307, in main
    subprocess.check_call(command, shell=True, executable='/bin/bash')
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'time seq 0 4 | parallel -k --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/input/03.reference_files/viral_db_v2.fa" --reads "/input/CL100135515_L02_unmap.bam" --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@5.gz" --task {}' returned non-zero exit status 5
MariaNattestad commented 4 years ago

Hi @bilgehannevruz

Did you also have a different reference passed to DeepVariant than the one you aligned the reads against? If you have different references for mapping and passed to DeepVariant, please make sure to use the same reference. Otherwise, can you tell me more about what your reference looks like?

Thanks! Maria

bilgehannevruz commented 4 years ago

Hi @MariaNattestad ,

I found the problem that it was related with ".bam" file. Alignment process wasn't complete for related file. For other files, process works as expected.

Thanks. Bilgehan

MariaNattestad commented 4 years ago

@bilgehannevruz okay great, thanks for the update!

moldach commented 4 years ago

Hey @AndrewCarroll I'm happy to announce that the issue with the BAM/REF mix-up is what was causing the issue as well