smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

gatk MarkDuplicates Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded #214

Closed Stanley80 closed 3 years ago

Stanley80 commented 3 years ago

Dear Spritz developers, thank you very much for your amazing work. We tried to run Spritz for 12 samples on a local machine with 40G RAM and many cores, but computation stopped with GATK MarkDuplicates command.

INFO 2021-07-20 06:16:14 MarkDuplicates Tracking 65403162 as yet unmatched pairs. 65358276 records in RAM. [Tue Jul 20 08:46:37 CEST 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 989.15 minutes. Runtime.totalMemory()=30150754304 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.HashMap.newNode(HashMap.java:1750) at java.util.HashMap.putVal(HashMap.java:631) at java.util.HashMap.put(HashMap.java:612) at htsjdk.samtools.SAMRecord.setTransientAttribute(SAMRecord.java:2318) at htsjdk.samtools.DuplicateScoringStrategy.computeDuplicateScore(DuplicateScoringStrategy.java:112) at htsjdk.samtools.DuplicateScoringStrategy.computeDuplicateScore(DuplicateScoringStrategy.java:62) at picard.sam.markduplicates.MarkDuplicates.buildReadEnds(MarkDuplicates.java:650) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:552) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Using GATK jar /mnt/bin/miniconda3/envs/spritzenv/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar We tried to figure out what the problem could be and we noticed that we were outofmemory in correspondence of mitochondrial genome due to the large number of mapped reads.

The output of samtools idxstats

1 248956422 151486569 0 2 242193529 129975569 0 3 198295559 121739113 0 4 190214555 80953331 0 5 181538259 96426281 0 6 170805979 94787913 0 7 159345973 79069305 0 8 145138636 64688331 0 9 138394717 85252265 0 10 133797422 64796009 0 11 135086622 88895141 0 12 133275309 73509958 0 13 114364328 52786721 0 14 107043718 134493736 0 15 101991189 52094551 0 16 90338345 44675919 0 17 83257441 54111496 0 18 80373285 32061298 0 19 58617616 32868629 0 20 64444167 30170370 0 21 46709983 43678658 0 22 50818468 18853590 0 X 156040895 50604944 0 Y 57227415 1872438 0 MT 16569 844752497 0

Our memory is bounded by 32G RAM and default value for --MAX_RECORDS_IN_RAM argument.

Is there any solution for this issue?

Thank you very much for your support

acesnik commented 3 years ago

Hi there! Please edit this file to double the RAM allocated to MarkDuplicates in this line:https://github.com/smith-chem-wisc/Spritz/blob/master/Spritz/rules/variants.smk#L1. Then, give it another try.

acesnik commented 3 years ago

i.e. change GATK_MEM=16000 # MB to GATK_MEM=32000 # MB

acesnik commented 3 years ago

Hope that works!

acesnik commented 3 years ago

Are you using the GUI or the commandline version?

puva commented 3 years ago

Hi, I work together with @Stanley80 I rerun with GATK_MEM=32000 # MB but the workflow stopped with the same error message

acesnik commented 3 years ago

Hi @puva, thanks for the update.

If possible, could you please bundle and zip the *.log and *.benchmark for the run where you encountered the bug? At minimum, could you please provide the log for the MarkDuplicates step where the issue arose?

I'm wondering if it is crashing midway through marking duplicates or during the sorting step.

A next step may be to back up combined.sorted.grouped.bam (move it to combined.sorted.grouped.bam_bkup) and then filter the mitochondrial sequences and name the filtered file combined.sorted.grouped.bam, so that the workflow starts from there.

If it is crashing during sorting, I updated spritz to use an assumed sort order in this step yesterday, so you could try replacing the shell command with this one, which should skip the sorting. https://github.com/smith-chem-wisc/Spritz/blob/master/Spritz/workflow/rules/variants.smk#L71

puva commented 3 years ago

Hi @acesnik thanks for your feedback. I'm attaching the log from MarkDuplicates step - it looks like MarkDuplicates is crashing. combined.sorted.grouped.marked.log

As you suggested, I'm replacing combined.sorted.grouped.bam with a smaller bam without the reads mapping on chrMT, which I suspect are causing the issue. Then, I'll rerun the workflow and I'll let you know.

acesnik commented 3 years ago

Yeah, there is a wild amount of reads on that tiny chromosome!

image

I think that should help, too. Best of luck!

acesnik commented 3 years ago

How did this run go?

puva commented 3 years ago

After removing the MT reads we were able to complete the workflow! Fortunately the disorder we are studying should not affect the mitochondria. For the other projects, I'll check whether there is an option in GATK for downsampling reads. Thanks for your help!

acesnik commented 3 years ago

That's great to hear! No problem!