google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.25k stars 727 forks source link

postprocess_variants: Found multiple file patterns in input filename space #818

Closed MiWitt closed 5 months ago

MiWitt commented 6 months ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6.1/docs/FAQ.md:

Describe the issue: The postprocess_variants step fails with following error message: ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

Setup

Steps to reproduce:

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-quick-start.md. Is there any way to reproduce the issue by using the quick start? ???

Any additional context: Yes. I can change the parameter "--infile" of the postprocess_variants.py call from "./call_variants_output.tfrecord.gz" to "./call_variants_output@1.tfrecord.gz" and it works. Anyway, the call of postprocess_variants.py is auto-generated by "/opt/deepvariant/bin/run_deepvariant". The error does not occur for every sample ...

directory content of intermediate_results_dir after the error occured: call_variants.log call_variants_output-00000-of-00001.tfrecord.gz gvcf.tfrecord-00000-of-00008.gz gvcf.tfrecord-00001-of-00008.gz gvcf.tfrecord-00002-of-00008.gz gvcf.tfrecord-00003-of-00008.gz gvcf.tfrecord-00004-of-00008.gz gvcf.tfrecord-00005-of-00008.gz gvcf.tfrecord-00006-of-00008.gz gvcf.tfrecord-00007-of-00008.gz make_examples.log make_examples.tfrecord-00000-of-00008.gz make_examples.tfrecord-00000-of-00008.gz.example_info.json make_examples.tfrecord-00001-of-00008.gz make_examples.tfrecord-00001-of-00008.gz.example_info.json make_examples.tfrecord-00002-of-00008.gz make_examples.tfrecord-00002-of-00008.gz.example_info.json make_examples.tfrecord-00003-of-00008.gz make_examples.tfrecord-00003-of-00008.gz.example_info.json make_examples.tfrecord-00004-of-00008.gz make_examples.tfrecord-00004-of-00008.gz.example_info.json make_examples.tfrecord-00005-of-00008.gz make_examples.tfrecord-00005-of-00008.gz.example_info.json make_examples.tfrecord-00006-of-00008.gz make_examples.tfrecord-00006-of-00008.gz.example_info.json make_examples.tfrecord-00007-of-00008.gz make_examples.tfrecord-00007-of-00008.gz.example_info.json postprocess_variants.log

kishwarshafin commented 6 months ago

@MiWitt , can you please send the full command here for each step? It seems like you have 8 files are you are setting @1?

MiWitt commented 6 months ago

I do not run it step by step. I run "run_deepvariant". This is my command:

 singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=${THEREF} \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.

I have now added the following command, which is a workaround for the problem ...

    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref="${THEREF}" \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi

Eventually this workaround sets --infile to "./call_variants_output@1.tfrecord.gz" and --nonvariant_site_tfrecord_path to "./gvcf.tfrecord@8.gz" (see directory listing above).

MiWitt commented 6 months ago

I could extract the three commands make_examples, call_variants and postprocess_variants from the output. Here it is:

seq 0 7 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "stdchroms.hg38.fa" --reads "SAMPLENAME.bam" --examples "./make_examples.tfrecord@8.gz" --add_hp_channel --alt_aligned_pileup "diff_channels" --gvcf "./gvcf.tfrecord@8.gz" --max_reads_per_partition "600" --min_mapping_quality "1" --parse_sam_aux_fields --partition_size "25000" --phase_reads --pileup_image_width "199" --norealign_reads --sample_name "SAMPLENAME" --sort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels "0.12" --task {}

/opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./make_examples.tfrecord@8.gz" --checkpoint "/opt/models/pacbio"

/opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./gvcf.tfrecord@8.gz" --sample_name "SAMPLENAME"

And here are the two last commands with std out ...

***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "./call_variants_output.tfrecord.gz" --examples "./make_examples.tfrecord@8.gz" --checkpoint "/opt/models/pacbio"

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
I0510 12:13:42.483308 47501039724352 call_variants.py:563] Total 1 writing processes started.
I0510 12:13:42.487790 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:42.487916 47501039724352 call_variants.py:588] Shape of input examples: [100, 199, 9]
I0510 12:13:42.488451 47501039724352 call_variants.py:592] Use saved model: True
I0510 12:13:52.162126 47501039724352 dv_utils.py:370] From /opt/models/pacbio/example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:52.163805 47501039724352 dv_utils.py:370] From ./make_examples.tfrecord-00000-of-00008.gz.example_info.json: Shape of input examples: [100, 199, 9], Channels of input examples: [1, 2, 3, 4, 5, 6, 7, 9, 10].
I0510 12:13:56.551032 47501039724352 call_variants.py:716] Predicted 982 examples in 1 batches [0.419 sec per 100].
I0510 12:13:57.403082 47501039724352 call_variants.py:779] Complete: call_variants.

real    0m21.581s
user    1m40.583s
sys 0m15.744s

***** Running the command:*****
time /opt/deepvariant/bin/postprocess_variants --ref "stdchroms.hg38.fa" --infile "./call_variants_output.tfrecord.gz" --outfile "./SAMPLENAME.deepVariant.vcf.gz" --cpus "8" --gvcf_outfile "./SAMPLENAME.deepVariant.g.vcf.gz" --nonvariant_site_tfrecord_path "./gvcf.tfrecord@8.gz" --sample_name "SAMPLENAME"

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1419, in <module>
    app.run(main)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/absl_py/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1300, in main
    sample_name = get_sample_name()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1203, in get_sample_name
    _, record = get_cvo_paths_and_first_record()
  File "/tmp/Bazel.runfiles_0t8uq2zt/runfiles/com_google_deepvariant/deepvariant/postprocess_variants.py", line 1179, in get_cvo_paths_and_first_record
    raise ValueError(
ValueError: ('Found multiple file patterns in input filename space: ', './call_variants_output.tfrecord.gz')

real    0m4.925s
user    0m8.815s
sys 0m7.379s
kishwarshafin commented 6 months ago

@MiWitt ,

Given that you are using --intermediate_results_dir . \ which writes all intermediate files to your directory, if you run the same command multiple times then it will create multiple patterns. Can you please create a clean intermediate directory and use that for --intermediate_results_dir /path/to/intermediate_dir? That should resolve the issue.

MiWitt commented 6 months ago

This can not be the point. I am working in a cluster environment using slurm and the dir "." is the job specific scratch dir, which is located at "/scratch/SlurmTMP/JobSpecificFolder" (${TMPDIR})


cd ${TMPDIR}
BIN_VERSION="1.6.1"
module load singularity/3.5.2

#####################################################################
# singularity pull docker://google/deepvariant:"${BIN_VERSION}"

ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150

#  --model_type=PACBIO \ ##Replace this string with exactly one of the following [WGS,WES,PACBIO,HYBRID_PACBIO_ILLUMINA]**
#  docker://google/deepvariant:"${BIN_VERSION}" \

if ! [ -f "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
then
  cp "${THEREF}"* ./
  cp "${WORKINDIR}/${ALIGNMENTNAME}.bam"* .
  chmod 666 `basename "${THEREF}"`*
  chmod 666 "${ALIGNMENTNAME}.bam"*
  singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
    /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=PACBIO \
    --ref=`basename "${THEREF}"` \
    --reads="${ALIGNMENTNAME}.bam" \
    --sample_name=${SAMPLENAME} \
    --output_vcf="./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
    --output_gvcf="./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
    --intermediate_results_dir . \
    --num_shards=8 \
    --logging_dir=.

    if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref=`basename "${THEREF}"` \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi
    cp *.log ${WORKINDIR}/
    cp "./${ALIGNMENTNAME}.deepVariant.vcf.gz"* ${WORKINDIR}/
else
 cp "${WORKINDIR}/${ALIGNMENTNAME}.deepVariant.vcf.gz"* .
fi
kishwarshafin commented 6 months ago

@MiWitt ,

Can you use --intermediate_results_dir ./intermediate_results_ ${ALIGNMENTNAME}. I am unsure why you are running postprocessing separately, but, something must be overwriting the files or generating multiple file patterns in the same directory where you are saving everything. One way to better debug is to set --dry_run=true for each command and look at the outputs and see if they match with each other. Unfortunately I don't have access to an HPC to replicate this issue. I tried running your script but it has many missing variables.

kishwarshafin commented 6 months ago

@MiWitt

Hi, do you have any updates on this issue?

kishwarshafin commented 5 months ago

@MiWitt , I am closing the issue due to inactivity. Please feel free to reopen if you have any updates.

EgorGuga commented 1 month ago

@kishwarshafin still same problem

EgorGuga commented 1 month ago

something wrong in get_cvo_paths_and_first_record(), it cannot properly parse call_variants_output-00000-of-00001.tfrecord.gz And maybe run_deepvariant.py needs to be change (at least in docker) for proper usage of multiprocessing of postprocess_varaint

MiWitt commented 1 month ago

@EgorGuga You can use my workround above, which solved the problem for me. If the --output_vcf from /opt/deepvariant/bin/run_deepvariant does not exist, run /opt/deepvariant/bin/postprocess_variants in a separate step.

if ! [ -f "./${ALIGNMENTNAME}.deepVariant.vcf.gz" ]
    then
       singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
         /my/path/software/deepVariant/deepvariant_${BIN_VERSION}.sif \
         /opt/deepvariant/bin/postprocess_variants \
         --ref=`basename "${THEREF}"` \
         --infile "./call_variants_output@$(ls ./call_variants_output*.tfrecord.gz | wc -l).tfrecord.gz" \
         --outfile "./${ALIGNMENTNAME}.deepVariant.vcf.gz" \
         --cpus "8" \
         --gvcf_outfile "./${ALIGNMENTNAME}.deepVariant.g.vcf.gz" \
         --nonvariant_site_tfrecord_path "./gvcf.tfrecord@$(ls ./gvcf.tfrecord*.gz | wc -l).gz" \
         --sample_name=${SAMPLENAME}
    fi
EgorGuga commented 1 month ago

@MiWitt, yes, thanks for that solution, I did a similar thing in the run_deepvariant script