Ran into a problem with DamageProfiler failing on two particularly large datasets, giving the following error:
DamageProfiler v0.4.9
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at org.apache.commons.math3.util.ResizableDoubleArray.expand(ResizableDoubleArray.java:697)
at org.apache.commons.math3.util.ResizableDoubleArray.addElement(ResizableDoubleArray.java:442)
at org.apache.commons.math3.stat.descriptive.DescriptiveStatistics.addValue(DescriptiveStatistics.java:171)
at IO.OutputGenerator.computeSummaryMetrics(OutputGenerator.java:542)
at calculations.StartCalculations.generateOutput(StartCalculations.java:357)
at calculations.StartCalculations.start(StartCalculations.java:302)
at RunDamageProfiler.main(RunDamageProfiler.java:37)
The input bam files are 44 or 53 GB with 1,329,979,427 and 1,516,157,394 records. I tried to change the memory resource provided to Java and using an updated version of Java (outside the eager 2.3.3 singularity image) but nothing worked.
After creating an issue in the DamageProfiler GitHub I tried to subsample the bam and then run DamageProfiler. This seemed to work and allowed me to run these two subsampled datasets for this process only and then -resume the pipeline.
The following is what I would put into the .command.sh if I were to encounter this problem again.
Maybe something a bit cleverer and generic than this could be considered so that profiling processes don't end up using excessive time/memory/cpu resource on those extremely deeply sequenced datasets, and more likely crashing or being stopped.
Ran into a problem with DamageProfiler failing on two particularly large datasets, giving the following error:
The input bam files are 44 or 53 GB with 1,329,979,427 and 1,516,157,394 records. I tried to change the memory resource provided to Java and using an updated version of Java (outside the eager 2.3.3 singularity image) but nothing worked.
After creating an issue in the DamageProfiler GitHub I tried to subsample the bam and then run DamageProfiler. This seemed to work and allowed me to run these two subsampled datasets for this process only and then
-resume
the pipeline.The following is what I would put into the
.command.sh
if I were to encounter this problem again.Maybe something a bit cleverer and generic than this could be considered so that profiling processes don't end up using excessive time/memory/cpu resource on those extremely deeply sequenced datasets, and more likely crashing or being stopped.
https://github.com/nf-core/eager/blob/bb32ae3b0110b9a26b791f73a5324828d849271a/main.nf#L2105