google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

Deepvariant 1.3 write to TMPDIR even if intermediate file dir is set #524

Closed edg1983 closed 2 years ago

edg1983 commented 2 years ago

Hello,

I'm using DeepVariant docker container v1.3 to call variants using the run_deepvariant command.

What I've done in the past to manage temp files was to create a temp_dir in the working directory and then use --intermediate_results_dir temp_dir to make DeepVariant write temp files in this custom location.

However, the same approach is not working anymore for me in the new HPC system since on computing node the default temp folder stored in $TMPDIR is set to a special space \localscratch that is not among the path automatically mounted by Docker or Singularity (like \tmp) apparently. I realized that, in addition to intermediate files written to --intermediate_results_dir, DeepVariant writes some additional temp files to the default temp dir location ($TMPDIR) and this created some issues when running it in pipelines (like Nextflow).

I've created a work around by manually setting $TMPDIR in the sh script so that it points to another folder in the work directory, and I can see there are a bunch of small files created in there (~30Mb total) like the following

Bazel.runfiles_6nvtcv_j  __pycache__  tmp8rz89h3g.py  tmpglc9d5x3.py  tmph9ntzkbx

I wonder which kind of files are written to $TMPDIR and if it's possible to redirect them by command line option without having to set $TMPDIR

pichuan commented 2 years ago

Hi @edg1983 , sorry that it took a while for me to get to this. Can you tell me which temp files are written to TMPDIR? (If you know which part of the code, even better) Thank you!

edg1983 commented 2 years ago

Hi,

As reported above I can see a bunch of small files (30-40 Mb total) written in TMPDIR. If I look in the folder when deepvariant is running I can see files like these Bazel.runfiles_6nvtcv_j __pycache__ tmp8rz89h3g.py tmpglc9d5x3.py tmph9ntzkbx

I will try another run to monitor when exactly they are created, but the job fails at very early stage when I submit it to the cluster, so I assume these are written during make_examples which is the first step in run deepvariant I think.

Thanks for support!

pichuan commented 2 years ago

Ah I see. Sorry I missed that part in your original message. And, I think I understand your question better now.

--intermediate_results_dir isn't designed to capture all temp files from DeepVariant. It's for capturing the intermediate outputs (from make_examples, call_variants) in case that users need to re-use them later on.

In your case, using your workaround of setting TMPDIR actually makes sense to me. From your description, it also seems like it's related to your system setting. If you think this is going to be a common issue, please share your command and I'm happy to add it to our documentation as a workaround for other users.

edg1983 commented 2 years ago

I agree this issue is probably system specific. This creates a problem when using the container in nextflow since nextflow automatically configures few folder bindings when it prepares the run, namely the working directory, the directories of files staged into the process as inputs and the temp dir indicated by $TMPDIR. Since it prepares all the scripts in advance, the $TMPDIR points to the standard /tmp location if I start nextflow from a login node, while in my system this is set to a node specific scratch space (/local scratch) when the job is submitted to a computing node by SLURM. Thus, I end up having the tmp dir not correctly mounted in the container.

I'm not sure how common such a configuration is, so maybe it's a problem affecting just me and few others.

What I've done is to add a line like this before the actual run_deepvariant command in my script section in the Nextflow process: export TMPDIR="$PWD/tmp_dir"

This overwrites the original variable and set the TMPDIR to a subfolder in the working directory. It works fine in this context since deepvariant is the only operation running in the process and thus changing TMPDIR does not interfere with anything else.

pichuan commented 2 years ago

Thank you @edg1983

I will plan to add this section to our FAQ.md:


Singularity related questions:

TMPDIR

If your run with Singularity is having issues with TMPDIR, try adding this to your command:

export TMPDIR="$PWD/tmp_dir"

See https://github.com/google/deepvariant/issues/524#issuecomment-1067597987.


This should show up in our next release. Thanks for providing this information! If you have more suggestions, let me know. I'll close this issue for now.

splaisan commented 2 years ago

about this tmp_dir location, when running the docker demo command in the current folder

do I set TMPDIR to a host physical folder that I create?

export TMPDIR="$PWD/tmp_dir"

or to a docker internal folder mount obtained after create a folder on the host side and mounting it as /tmp_dir with -v as shown below

export TMPDIR="/tmp_dir"

my current command includes --intermediate_results_dir /temp_dir which is apparently not yet working:

export TMPDIR=<chose from above>

for bam in input/*_rawmappings_recal.bam; do

pfx=$(basename ${bam%_rawmappings_recal.bam})

BIN_VERSION="1.3.0"

sudo docker run \
  -v "$PWD/input":"/input" \
  -v "$PWD/output":"/output" \
  -v "$PWD/tmp_dir":"/tmp_dir" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref=/input/Gallus_gallus.GRCg6a.dna.toplevel.fa \
  --reads=/input/"${pfx}"_rawmappings_recal.bam \
  --output_vcf=/output/"${pfx}".vcf \
  --output_gvcf=/output/"${pfx}".g.vcf \
  --num_shards="${nthr}" \
  --intermediate_results_dir /temp_dir \
  --logging_dir=/output/"${pfx}"_logs \
  --dry_run=false

done
pichuan commented 2 years ago

Hi @splaisan Your question is using docker, which is a bit different from the discussion earlier, I believe.

To use --intermediate_results_dir, it indicates you probably want to access the content there later. So, I'd recommend that you write it to an output file that you mounted with -v.

For example, given that you have -v "$PWD/tmp_dir":"/tmp_dir", maybe try seeting: --intermediate_results_dir /tmp_dir, which should write output to $PWD/tmp_dir once you're done? (I noticed you wrote --intermediate_results_dir /temp_dir, which was not actually mounted. Not sure if that's a typo or not.)

Hope this helps.

I'm going to close this issue now.

splaisan commented 2 years ago

Thanks @pichuan, it is a typo indeed, I meant /tmp_dir nice catch! and thanks for your kind help