google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.15k stars 709 forks source link

DeepTrio Quickstart Test Command error #632

Closed ivanwilliammd closed 1 year ago

ivanwilliammd commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes

Describe the issue: Having problem running deeptrio examples https://github.com/google/deepvariant/blob/r1.5/docs/deeptrio-wgs-case-study.md

Setup

Steps to reproduce:

It always give Error: The directory "/reference/GRCh38_no_alt_analysis_set.sdf" already exists. Please remove it first or choose a different directory. even after I ensure that there are no GRCh38_no_alt_analysis_set.sdf exist in said directory

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/reference":"/reference" \
  realtimegenomics/rtg-tools format \
  -o /reference/GRCh38_no_alt_analysis_set.sdf "/reference/GRCh38_no_alt_analysis_set.fasta"

And that being said, this command also raises another error showing Error: An IO problem occurred: "Not in GZIP format"

sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/output":"/output" \
realtimegenomics/rtg-tools mendelian \
-i "/output/HG002_trio_merged.vcf.gz" \
-o "/output/HG002_trio_annotated.output.vcf.gz" \
--pedigree=/reference/trio.ped \
-t /reference/GRCh38_no_alt_analysis_set.sdf \
| tee output/deepvariant.input_rtg_output.txt

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start? Quick start on single variant analysis is optimal

Any additional context:

akolesnikov commented 1 year ago

Hi @ivanwilliammd,

realtimegenomics - is a third party tool https://github.com/RealTimeGenomics/rtg-tools

What is the content of your ${PWD}/reference directory?

ivanwilliammd commented 1 year ago

Hi @akolesnikov

Thanks for your reply

Here is my ls -l ${PWD}/reference

total 3070560
-rw-rw-r-- 1 ivanwilliamharsono ivanwilliamharsono 3144230986 Apr 16 22:28 GRCh38_no_alt_analysis_set.fasta
-rw-rw-r-- 1 ivanwilliamharsono ivanwilliamharsono       7804 Apr 16 22:28 GRCh38_no_alt_analysis_set.fasta.fai
drwxr-xr-x 2 root               root                     4096 Apr 17 07:41 GRCh38_no_alt_analysis_set.sdf
-rw-rw-r-- 1 ivanwilliamharsono ivanwilliamharsono        253 Apr 17 08:08 trio.ped

Everytime I start, I have already makesure to remove GRCh38_no_alt_analysis_set.sdf directory first

ivanwilliammd commented 1 year ago

Update : just now the rtg-tools format miraculously work using the following command

docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/reference":"/reference" \
  realtimegenomics/rtg-tools format \
  -o /reference/GRCh38_no_alt_analysis_set.sdf "/reference/GRCh38_no_alt_analysis_set.fasta"

And this is the result

Formatting FASTA data
Processing "/reference/GRCh38_no_alt_analysis_set.fasta"

Detected: 'Human GRCh38 with UCSC naming', installing reference.txt

Input Data
Files              : GRCh38_no_alt_analysis_set.fasta
Format             : FASTA
Type               : DNA
Number of sequences: 195
Total residues     : 3099922541
Minimum length     : 970
Mean length        : 15897038
Maximum length     : 248956422

Output Data
SDF-ID             : 809c9a82-d8d5-477a-865b-772d28741815
Number of sequences: 195
Total residues     : 3099922541
Minimum length     : 970
Mean length        : 15897038
Maximum length     : 248956422

However this rtg-tools mendelian still result in error of Error: An IO problem occurred: "Not in GZIP format" when running the following command below

docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/output":"/output" \
realtimegenomics/rtg-tools mendelian \
-i "/output/HG002_trio_merged.vcf.gz" \
-o "/output/HG002_trio_annotated.output.vcf.gz" \
--pedigree=/reference/trio.ped \
-t /reference/GRCh38_no_alt_analysis_set.sdf \
| tee output/deepvariant.input_rtg_output.txt

Have tried looking the wrong data and it seems the following GLNexus VCF Merge command is giving corrupted HG002_trio_merged.vcf.gz

docker run \
  -v "${PWD}/output":"/output" \
  quay.io/mlin/glnexus:v1.2.7 \
  /usr/local/bin/glnexus_cli \
  --config DeepVariant_unfiltered \
  /output/HG002.g.vcf.gz \
  /output/HG003.g.vcf.gz \
  /output/HG004.g.vcf.gz \
  | docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}-gpu" \
    bcftools view - \
  | docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}-gpu" \
    bgzip -c > output/HG002_trio_merged.vcf.gz
ivanwilliammd commented 1 year ago

Solve the problem --> the pipeline of using deeptrio bcftools cause error.

Installing bcftools and using the following script works... I hope this help to improve the pipelining

docker run \
  -v "${PWD}/output":"/output" \
  quay.io/mlin/glnexus:v1.2.7 \
  /usr/local/bin/glnexus_cli \
  --config DeepVariant_unfiltered \
  /output/HG002.g.vcf.gz \
  /output/HG003.g.vcf.gz \
  /output/HG004.g.vcf.gz \
  | bcftools view -Oz -o ${PWD}/output/HG002_trio_merged.vcf.gz 
pichuan commented 1 year ago

Thanks for the update @ivanwilliammd !