groupschoof / AHRD

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.
https://www.cropbio.uni-bonn.de/
Other
63 stars 21 forks source link

Java memory problem: GC and Heap Speace #16

Closed jrtejero closed 5 years ago

jrtejero commented 5 years ago

Hello everyone,

I am exhausted to try running AHRD on my computer in order to annotate an average size genome (9k proteins) and less than 20Mb protein sequences file size, even if a try it agains a very small database (21Mb). I am always facing the same two problems related with JAVA:

As far as I am concerned, this issues are obviously related with memory space adressed to the GarbageCollector. I have been tried several AHRD versions combined with a bunch of Java versions too, resulting in always the very same problems after 20-16 hours working on nothing (neither output nor log error). The las command launched was:

_java -Xms60g -Xmx120g -jar /home/user/Software/AHRD-2.1-stable/dist/ahrd.jar ahrd_exampleinput.yml

I don't know if I am doing something wrong or there is any problem with the AHRD code related with the managment of GC.

Could anybody help me? I left down here the features of my machine and versions stuff:

Thanks in advance.

groupschoof commented 5 years ago

Dear Jorge,

thank you for your message and for using AHRD.

I have left the group of Prof Schoof some time ago, but I can suggest to use a more current feature complete version of AHRD.

Please follow these installation instructions:

git clone https://github.com/groupschoof/AHRD.git cd AHRD git checkout tags/v3.4

Note that the tag is no longer v3.3, but v3.4.

AHRD now is more memory efficient and sets up a database, which on a subsequent run will greatly speed up execution time. Of course only if your blast databases have not changed.

For the full Manual please see https://github.com/asishallab/AHRD especially section 211-ahrds-database https://github.com/asishallab/AHRD#211-ahrds-database

Let us know, if the error persists.

Thank you and all the best!

From: "Jorge A. Ramírez-Tejero" notifications@github.com To: "groupschoof/AHRD" AHRD@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Sent: Saturday, 16 February, 2019 17:21:39 Subject: [groupschoof/AHRD] Java memory problem: GC and Heap Speace (#16)

Hello everyone,

I am exhausted to try running AHRD on my computer in order to annotate an average size genome (9k proteins) and less than 20Mb protein sequences file size, even if a try it agains a very small database (21Mb). I am always facing the same two problems related with JAVA:

* Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 
* “java.lang.OutOfMemoryError: GC overhead limit exceeded” 

As far as I am concerned, this issues are obviously related with memory space adressed to the GarbageCollector. I have been tried several AHRD versions combined with a bunch of Java versions too, resulting in always the very same problems after 20-16 hours working on nothing (neither output nor log error). The las command launched was:

java -Xms60g -Xmx120g -jar /home/user/Software/AHRD-2.1-stable/dist/ahrd.jar ahrd_example_input.yml

I don't know if I am doing something wrong or there is any problem with the AHRD code related with the managment of GC.

Could anybody help me? I left down here the features of my machine and versions stuff:

* 

Ubuntu 16.04.5 LTS (Xenial Xerus) on a Fujitsu Celsius workstation with: (i) 48 CPUs at 2.3GHz (ii) RAM of 128Gb (4x16GBmodules plus 2x32GBmodules)

Java tried versions: 0 /usr/lib/jvm/java-9-openjdk-amd64/bin/java 1091 auto mode 1 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode 2 /usr/lib/jvm/java-8-oracle/jre/bin/java 1081 manual mode 3 /usr/lib/jvm/java-9-openjdk-amd64/bin/java 1091 manual mode 4 /usr/lib/jvm/java-ibm-x86_64-80/jre/bin/java 80 manual mode

AHRD tried versions: 2.1-stable & 3.3

Thanks in advance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [ https://github.com/groupschoof/AHRD/issues/16 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABclmO0Emx0dbF4wibvPNBIs5uTvfwHSks5vODATgaJpZM4a_D40 | mute the thread ] .

-- Dr. Asis Hallab Comparative Development and Genetics (Miltos Tsiantis) Max Planck Institute for Plant Breeding Research Carl-von-Linné-Weg 10 50829 Köln (Cologne) Germany Phone: +49-221-5062-157

jrtejero commented 5 years ago

Thanks for your quick response!

I have managed to update JAVA and next, I have followed your advice. The command is running at the moment but still the same thing as previous attempts in terms of outputting files (there is no producing any output at all). I suppose that this could be reasonably logicall, as well as it just started.

I will let you know if it works!

Thanks again :smiley:

jrtejero commented 5 years ago

Hello again,

I have just remember that I experienced some issue with my file headers and AHRD did not recognize them. To solve this, I needed to use the option:

fasta_header_regex= ^>(\w+\S+)\s+.+:protein_coding_description:(.+)?$

Could you think that this issue might be difficulting the AHRD data processing and therefore, overwhelming JAVA GC?

groupschoof commented 5 years ago

Hi there!

Most definitely. If the parser does not reliably recognizes when a new Sequence starts in the fasta all so far encountered lines are kept in memory. Thus memory requirements can explode.

Does it work now?

Cheers!

From: "Jorge A. Ramírez-Tejero" notifications@github.com To: "groupschoof/AHRD" AHRD@noreply.github.com Cc: "Group Prof. Dr. Heiko Schoof" hallab@mpiz-koeln.mpg.de, "Comment" comment@noreply.github.com Sent: Sunday, 17 February, 2019 21:24:52 Subject: Re: [groupschoof/AHRD] Java memory problem: GC and Heap Speace (#16)

Hello again,

I have just remember that I experienced some issue with my file headers and AHRD did not recognize them. To solve this, I needed to use the option:

fasta_header_regex= ^>(\w+\S+)\s+.+:protein_coding_description:(.+)?$

Could you think that this issue might be difficulting the AHRD data processing and therefore, overwhelming JAVA GC?

— You are receiving this because you commented. Reply to this email directly, [ https://github.com/groupschoof/AHRD/issues/16#issuecomment-464504447 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABclmPsymGuNSBYPU8h3ykehMUsMV2dQks5vObqUgaJpZM4a_D40 | mute the thread ] .

-- Dr. Asis Hallab Comparative Development and Genetics (Miltos Tsiantis) Max Planck Institute for Plant Breeding Research Carl-von-Linné-Weg 10 50829 Köln (Cologne) Germany Phone: +49-221-5062-157

jrtejero commented 5 years ago

Hi Dr. Asis!

I had to stop the process, because it was consuming again excesive amount of GC memory and wasn't able to produce anything. Given that I have upgraded AHRD following your recommendation, I think that it is time to repeat my blast searches with the Blast+ version 2.7.1. Thus, I hope to obtain a proper fasta header name in order to avoid the option that seemed to be struggling the JAVA's work (fasta_header_regex).

Just in case it helps, I left down here the headers I had in the last attemp:

Do you think that my new approach could be fine?

Thanks for your help!

Regards,

jrtejero commented 5 years ago

I proceed to close this issue, given that GC memory problem was solved. This problem was caused by an incorret Java Regexp used.