Open jarbet opened 2 years ago
It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.
Also, it looks like no major updates for MarkDuplicatesSpark since 4.2.4.1 (current) (the latest is 4.2.6.1)
Yeah barring special nodes with expanded disk space or moving MarkDuplicatesSpark to run per library one by one, it'd be hard to fix this problem.
I think we want to implement #234 in the long run but we could also try -Dsamjdk.compression_level
option although I couldn't find the default compression level info for MarkDuplicatesSpark
.
--java-options -Dsamjdk.compression_level=X
Currently testing with --java-options -Dsamjdk.compression_level=6
@jarbet Did changing the compression level help? I'm running into the same issue with a subset of CPCG. It looks like samples with total fastq size > ~400Gb will fail with the current Spark configuration. The fastq size distribution of CPCG overlaps with this ~400Gb limit with ~1/3 of the cohort being too large.
I was trying to monitor scratch usage, but the intermediate files generated by Spark are assigned to nfsnobody
with no read access so I can't query directory size.
nfsnobody is a user account that is used by NFS (Network File System) when it cannot map a remote user to a local user. This can happen when the remote user does not exist on the local system or when the local system cannot authenticate the remote user. When this happens, NFS uses the nfsnobody account instead of the remote user’s account.
Not sure if there's a way to properly map the users so this doesn't happen, but this is probably low priority.
@tyamaguchi-ucla mentioned that it could potentially be possible to have Spark parallelize less, theoretically reducing data copying and scratch usage. The parameters for this are located in the F72.config and not template.config or default.config. @yashpatel6 would reducing the number of cpus allowed for the run_MarkDuplicatesSpark_GATK
process reduce scratch usage?
If not, it looks like ~1/3 of CPCG will need to be run without Spark or dependent upon upgrades to our F72 scratch size.
If we can conclude that the only way around this is by increasing scratch size, I can write up a cost-benefit analysis of upgrading scratch vs. having to run larger samples with Picard and send it to Paul.
It looks like the current metapipeline bottlenecks are scratch space during align-DNA mark duplicates Spark and call-gSNP recalibrate/reheader steps so it might be necessary to expand scratch regardless unless we can make optimizations at both of these steps.
Describe the bug
Pipeline failed when testing on
CPCG0196-F1
, givingerror exit status (3)
forrun_MarkDuplicatesSpark_GATK
. First noticed here.Testing info/results:
BWA-MEM2 (failed after 19 hours)
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/testing_CPCG0196-F1.sh
CPCG0196-F1
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
HISAT2 (failed after 22 hours)
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log
Note that
BWA-MEM2
andHISAT2
give slightly different error messages. Both say the following:But only
HISAT2
says the following (several times) in regards torun_MarkDuplicatesSpark_GATK
: