Run MarkDuplicatesSpark by library

tyamaguchi-ucla commented 2 years ago

This is not too urgent but we probably want to implement the following processes

Run MarkDuplicatesSpark by library (or SAMtools markdup)
Remove intermediate files
Merge BAMs with SAMtools merge

so that we can process large samples with multiple libraries (e.g. CPCG0196-F1) with 2TB scratch

We could parallelize #1 for intermediate size samples with multiple libraries (e.g. CPCG0196-B1) but not sure if this would be always faster because the library level BAMs need to be merged.

""" It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

Originally posted by @tyamaguchi-ucla in https://github.com/uclahs-cds/pipeline-align-DNA/issues/229#issuecomment-1199972121

tyamaguchi-ucla commented 1 year ago

@nkwang24 @yashpatel6 For multi-library samples, this approach would help although it will take some time to implement it. Also, it would be helpful to understand the usage of /scratch space between MarkDuplicatesSpark and SAMtools markdup.

nkwang24 commented 1 year ago

@tyamaguchi-ucla agreed. I wrote a script to periodically log the scratch use over the course of a metapipeline run, but as I commented in #229, I can't access the files generated by Spark. The best I've been able to do is correlate sample size with where in metapipeline the failures occur. Based on what I've gathered so far using the latest metapipeline PR, it looks like align-DNA lets samples of up to ~450Gb through. Of these, call-gSNP lets ~400Gb through.

It's really hard to get a good idea of what's going as my tests have been somewhat inconsistent and confounded by stochastic node level errors. Possible sources of inconsistencies:

Dependent on tumor vs normal fastq size and not total fastq size
@yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

yashpatel6 commented 1 year ago

2. @yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

We discussed this briefly in the metapipeline-DNA meeting but just to log the discussion here:

The intermediate file deletion with individual pipelines doesn't depend on the write to /hot since when intermediate file deletion is enabled, the deleted files are never written to /hot. And conversely, any output files written to /hot under that case aren't subject to the deletion process.

From a inter-pipeline deletion perspective, there's only one case where this happens, which is when align-DNA output is deleted from /scratch once it's been copied to /hot and used by the first step of call-gSNP. This process could potentially be impacted by latency since if the copying takes a long time due to latency, then call-gSNP may continue while the deletion process is still waiting for the files to finish being copied to /hot. This specific case can actually be traced from the .command.log from the failing sample/patient by checking if the deletion process had completed by the time the pipeline failed

tyamaguchi-ucla commented 1 year ago

https://github.com/broadinstitute/gatk/issues/8134 This is another good reason to consider using samtools markdup instead if benchmarking is promising.

uclahs-cds / pipeline-align-DNA

Run MarkDuplicatesSpark by library #234