Open tyamaguchi-ucla opened 2 years ago
@nkwang24 @yashpatel6 For multi-library samples, this approach would help although it will take some time to implement it. Also, it would be helpful to understand the usage of /scratch
space between MarkDuplicatesSpark
and SAMtools markdup
.
@tyamaguchi-ucla agreed. I wrote a script to periodically log the scratch use over the course of a metapipeline run, but as I commented in #229, I can't access the files generated by Spark. The best I've been able to do is correlate sample size with where in metapipeline the failures occur. Based on what I've gathered so far using the latest metapipeline PR, it looks like align-DNA lets samples of up to ~450Gb through. Of these, call-gSNP lets ~400Gb through.
It's really hard to get a good idea of what's going as my tests have been somewhat inconsistent and confounded by stochastic node level errors. Possible sources of inconsistencies:
2. @yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted
We discussed this briefly in the metapipeline-DNA meeting but just to log the discussion here:
The intermediate file deletion with individual pipelines doesn't depend on the write to /hot
since when intermediate file deletion is enabled, the deleted files are never written to /hot
. And conversely, any output files written to /hot
under that case aren't subject to the deletion process.
From a inter-pipeline deletion perspective, there's only one case where this happens, which is when align-DNA output is deleted from /scratch
once it's been copied to /hot
and used by the first step of call-gSNP. This process could potentially be impacted by latency since if the copying takes a long time due to latency, then call-gSNP may continue while the deletion process is still waiting for the files to finish being copied to /hot
. This specific case can actually be traced from the .command.log
from the failing sample/patient by checking if the deletion process had completed by the time the pipeline failed
https://github.com/broadinstitute/gatk/issues/8134 This is another good reason to consider using samtools markdup
instead if benchmarking is promising.
This is not too urgent but we probably want to implement the following processes
so that we can process large samples with multiple libraries (e.g. CPCG0196-F1) with 2TB scratch
We could parallelize
#1
for intermediate size samples with multiple libraries (e.g. CPCG0196-B1) but not sure if this would be always faster because the library level BAMs need to be merged.""" It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.
Originally posted by @tyamaguchi-ucla in https://github.com/uclahs-cds/pipeline-align-DNA/issues/229#issuecomment-1199972121