lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

sam2tsv insufficient memory error #92

Closed mt1022 closed 6 years ago

mt1022 commented 6 years ago

Subject of the issue

I have hundreds of bam files to process. I submitted all the jobs to a node with 1T memory and 20 sam2tsv processes were executed simultaneously. However some output files were empty due to memory insifficient error.

Your environment

Steps to reproduce

Expected behaviour

As I guess sam2tsv processes bam line by line and the reference is 3.1G (human genome), I think 1T memory is enough for 20 such processes (expected to consume 620G memory ).

Actual behaviour

However some commands produced the expected results and some failed with the following error:

# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create GC thread. Out of system resources.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (gcTaskThread.cpp:48), pid=46378, tid=47651459467008
#
# JRE version:  (8.0_40-b26) (build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.40-b25 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
mt1022 commented 6 years ago

As I have no experience in Java, I re-wrote sam2tsv in python with pysam. The python version can be found here: https://gist.github.com/mt1022/737ef20f43d5acd4bc75dba0be8f334b. It produce similar output as sam2tsv in jarkit and with some difference in output for soft-clipping and deletion in reads.

lindenb commented 6 years ago

you can increase the memory using the -xmx option of java https://stackoverflow.com/questions/14763079/

but anyway, I'm suprised sam2tsv produced such error because it uses a simple streaming process (the information of each sam record is printed as soon it is read). What was your exact command line please ?

mt1022 commented 6 years ago

I use almost default setting. My command line looks like: sam2tsv -r hg38.fa $i | awk '$5 != "."' >${i}.pos

here, sam2tsv is path to a bash script:

#!/bin/bash
java -Djvarkit.log.name=sam2tsv -Dfile.encoding=UTF8 -Xmx2G     -cp "..." com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv $*

... in quote is path to htsjdk-2.9.1.jar, commons-logging-1.1.1.jar, ...(many another .jar files) and sam2tsv.jar.

lindenb commented 6 years ago

can you please test this without the bash script (the manual doesn't mention it...)

this should be:

java -jar /path/to/sam2tsv.jar  -r hg38.fa $i  | awk '$5 != "."' >${i}.pos
mt1022 commented 6 years ago

OK. I'll test it when there are idle fat node.

mt1022 commented 6 years ago

Hi, Dr. Lindenbaum,

I got time to test the java -jar /path/to/sam2tsv.jar command today. It worked as expected and no "memory insufficient" error occured. It seems that the error is due to the bash script.