run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1

jarbet commented 2 years ago

Describe the bug

Pipeline failed when testing on CPCG0196-F1, giving error exit status (3) for run_MarkDuplicatesSpark_GATK. First noticed here.

Pipeline release version: unreleased, this branch (note this branch has not implemented any retry methods for memory)
Cluster you are using: Slurm
Node type: F72
Submission method: python submission script
Actual submission script (python submission script, "nextflow run ...", etc.)

Testing info/results:

BWA-MEM2 (failed after 19 hours)
- submission script: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/testing_CPCG0196-F1.sh
- sample: CPCG0196-F1
- input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
HISAT2 (failed after 22 hours)
- submission script: same
- sample: same
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log

Note that BWA-MEM2 and HISAT2 give slightly different error messages. Both say the following:

Error executing process > 'align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK' Caused by: Process align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (3)

But only HISAT2 says the following (several times) in regards to run_MarkDuplicatesSpark_GATK :

No space left on device

tyamaguchi-ucla commented 2 years ago

It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

tyamaguchi-ucla commented 2 years ago

Also, it looks like no major updates for MarkDuplicatesSpark since 4.2.4.1 (current) (the latest is 4.2.6.1)

yashpatel6 commented 2 years ago

Yeah barring special nodes with expanded disk space or moving MarkDuplicatesSpark to run per library one by one, it'd be hard to fix this problem.

tyamaguchi-ucla commented 2 years ago

I think we want to implement #234 in the long run but we could also try -Dsamjdk.compression_level option although I couldn't find the default compression level info for MarkDuplicatesSpark.

--java-options -Dsamjdk.compression_level=X

https://gatk.broadinstitute.org/hc/en-us/community/posts/360061711971-How-to-set-a-COMPRESSION-LEVEL-of-ApplyBQSR

jarbet commented 2 years ago

Currently testing with --java-options -Dsamjdk.compression_level=6

nkwang24 commented 1 year ago

@jarbet Did changing the compression level help? I'm running into the same issue with a subset of CPCG. It looks like samples with total fastq size > ~400Gb will fail with the current Spark configuration. The fastq size distribution of CPCG overlaps with this ~400Gb limit with ~1/3 of the cohort being too large.

I was trying to monitor scratch usage, but the intermediate files generated by Spark are assigned to nfsnobody with no read access so I can't query directory size.

nfsnobody is a user account that is used by NFS (Network File System) when it cannot map a remote user to a local user. This can happen when the remote user does not exist on the local system or when the local system cannot authenticate the remote user. When this happens, NFS uses the nfsnobody account instead of the remote user’s account.

Not sure if there's a way to properly map the users so this doesn't happen, but this is probably low priority.

nkwang24 commented 1 year ago

@tyamaguchi-ucla mentioned that it could potentially be possible to have Spark parallelize less, theoretically reducing data copying and scratch usage. The parameters for this are located in the F72.config and not template.config or default.config. @yashpatel6 would reducing the number of cpus allowed for the run_MarkDuplicatesSpark_GATK process reduce scratch usage?

If not, it looks like ~1/3 of CPCG will need to be run without Spark or dependent upon upgrades to our F72 scratch size.

nkwang24 commented 1 year ago

If we can conclude that the only way around this is by increasing scratch size, I can write up a cost-benefit analysis of upgrading scratch vs. having to run larger samples with Picard and send it to Paul.

It looks like the current metapipeline bottlenecks are scratch space during align-DNA mark duplicates Spark and call-gSNP recalibrate/reheader steps so it might be necessary to expand scratch regardless unless we can make optimizations at both of these steps.

uclahs-cds / pipeline-align-DNA

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229