gmarocena / gasv

Automatically exported from code.google.com/p/gasv
0 stars 0 forks source link

OutOfMemoryError - GC overhead limit exceeded #12

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
BamToGASV.jar

What is the expected output? What do you see instead?
In cases 1, 2, 3 and 4 listed below => OutOfMemoryError - GC overhead limit 
exceeded (java), BAMtogasv stops
In case 5 => BAMtogasv does not produce any error but seems to stop parsing the 
bam file after 500M of reads (~12h).

What version of the product are you using? On what operating system?
Program: BAMToGASV
Version: 2.0.1
from GASVRelease_June27_2013.tgz

Please provide any additional information below.
BAM file is big (1286194550 reads)
Library is 2kb insert size

I tested different cases:
1 - default  GASVPro-HQ.sh parameters (on 12Gb RAM and 32Gb RAM computers)
2 - JAVAPREFIX="java -jar -Xms1g -Xmx8g"
3 - JAVAPREFIX="java -jar -Xms1g -Xmx26g"
4 - JAVAPREFIX="java -jar -Xms1g -Xmx8g -XX:-UseGCOverheadLimit"
5 - JAVAPREFIX="java -jar -Xms1g -Xmx26g -XX:-UseGCOverheadLimit"

I guess the problem is the size of my BAM file (~110Gb)? Is there a way of 
running GASV on this file without using a cluster with more memory ?

Thanks for your help,
Alex

Original issue reported on code.google.com by ale.gil...@gmail.com on 3 Sep 2013 at 1:31

GoogleCodeExporter commented 8 years ago
Hi Alex,

Thank you for your comment and for your interest in GASV and GASVPro.

The GC error is definitely related to the size of your BAM file and, in my 
experience, is far more likely to occur with a cancer genome than a normal 
genome. 

We are presently working on improvements to our BAM file processor, but in the 
mean time I have some suggested modifications to your BAM file to make 
processing with BAMToGASV more efficient:

(1) Separate BAM File by Chromosome:

Separate your BAM file by chromosome and run BAMToGASV on each chromosome 
separated file. 

(Note: To be able to correctly identify translocations, you'd need to know 
which chromosome each read maps to. So you would need to sort the BAM file 
first by read name and output translocations separately. Or, if you are not 
interested in translocations, don't worry about the pairing and simply separate 
the BAM file by chromosome.)

I would recommend running BAMToGASV on one chromosome first to obtain the 
values for Lmin/Lmax and then (on subsequent chromosomes) specify the Lmin/Lmax 
values in the BAMToGASV command for consistency with the results.

(2) Sorted BAM File:

I'm assuming that your BAM file contains only a single mapping for each read 
and that your BAM file is possibly sorted by location.

If you sort your BAM file by read name then BAMToGASV will not need to use as 
much memory to store reads (since read pairs will be adjacent in the BAM file).

----

Please let me know if these options make sense and are helpful to you. I am 
glad to help with any additional questions.

Cheers,

Suzanne

Original comment by sora...@gmail.com on 3 Sep 2013 at 6:31

GoogleCodeExporter commented 8 years ago

Original comment by sora...@gmail.com on 27 Feb 2014 at 2:14