harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Gather VCFs tmp dir space #51

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

I have occasionally seen that large datasets run out of temporary directory space for the sort and gather VCF step. After some digging, I believe this is because the Java tmp dir is hard-coded to /tmp, and if you happen to be running on a node without much space in /tmp, you'll have issues.

The Java tmp directory can be specified either with the _JAVA_OPTIONS environment variable or as part of the command line, so we have at least two options to address this.

  1. Add a note to the readme to have users add export _JAVA_OPTIONS=-Djava.io.tmpdir=/new/tmp/dir to the run_pipeline.sh script if they are running on a large dataset that might have issues, or add this ourselves to the default run script.
  2. Add -Djava.io.tmpdir=/path/to/tmpdir to the gatk commands where temp space may be an issue, pulling the tmp dir location from the config file.

There are probably option options as well. Thoughts?

erikenbody commented 2 years ago

Is there any downside to option #2? Especially in cluster environments it seems like this will be a common issue

tsackton commented 2 years ago

Only in that there could be performance impacts of using a networked scratch drive as opposed to local attached storage. But, if someone has a big enough direct attached scratch they can just specify that in the config, I guess.