Closed tsackton closed 2 years ago
Is there any downside to option #2? Especially in cluster environments it seems like this will be a common issue
Only in that there could be performance impacts of using a networked scratch drive as opposed to local attached storage. But, if someone has a big enough direct attached scratch they can just specify that in the config, I guess.
I have occasionally seen that large datasets run out of temporary directory space for the sort and gather VCF step. After some digging, I believe this is because the Java tmp dir is hard-coded to /tmp, and if you happen to be running on a node without much space in /tmp, you'll have issues.
The Java tmp directory can be specified either with the _JAVA_OPTIONS environment variable or as part of the command line, so we have at least two options to address this.
export _JAVA_OPTIONS=-Djava.io.tmpdir=/new/tmp/dir
to the run_pipeline.sh script if they are running on a large dataset that might have issues, or add this ourselves to the default run script.-Djava.io.tmpdir=/path/to/tmpdir
to the gatk commands where temp space may be an issue, pulling the tmp dir location from the config file.There are probably option options as well. Thoughts?