nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
134 stars 79 forks source link

subset large datasets for dataset-wide statistics #942

Open ChristopherBarrington opened 1 year ago

ChristopherBarrington commented 1 year ago

Ran into a problem with DamageProfiler failing on two particularly large datasets, giving the following error:

DamageProfiler v0.4.9
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at org.apache.commons.math3.util.ResizableDoubleArray.expand(ResizableDoubleArray.java:697)
        at org.apache.commons.math3.util.ResizableDoubleArray.addElement(ResizableDoubleArray.java:442)
        at org.apache.commons.math3.stat.descriptive.DescriptiveStatistics.addValue(DescriptiveStatistics.java:171)
        at IO.OutputGenerator.computeSummaryMetrics(OutputGenerator.java:542)
        at calculations.StartCalculations.generateOutput(StartCalculations.java:357)
        at calculations.StartCalculations.start(StartCalculations.java:302)
        at RunDamageProfiler.main(RunDamageProfiler.java:37)

The input bam files are 44 or 53 GB with 1,329,979,427 and 1,516,157,394 records. I tried to change the memory resource provided to Java and using an updated version of Java (outside the eager 2.3.3 singularity image) but nothing worked.

After creating an issue in the DamageProfiler GitHub I tried to subsample the bam and then run DamageProfiler. This seemed to work and allowed me to run these two subsampled datasets for this process only and then -resume the pipeline.

The following is what I would put into the .command.sh if I were to encounter this problem again.

samtools view --subsample-seed 0 --subsample 0.05 --threads 32 --bam --output dataset_rmdup.subset.bam dataset_rmdup.bam
samtools index dataset_rmdup.subset.bam

damageprofiler -Xmx250g -i dataset_rmdup.subset.bam -r canFam3.fa -l 100 -t 15 -o . -yaxis_damageplot 0.30

mv dataset_rmdup.subset dataset_rmdup

Maybe something a bit cleverer and generic than this could be considered so that profiling processes don't end up using excessive time/memory/cpu resource on those extremely deeply sequenced datasets, and more likely crashing or being stopped.

https://github.com/nf-core/eager/blob/bb32ae3b0110b9a26b791f73a5324828d849271a/main.nf#L2105