mmisw / mmiorr

Unmaintained old MMI ORR system (v2) -- New development at https://github.com/mmisw/orr
2 stars 1 forks source link

Allegrograph memory problem #281

Closed mmisw closed 9 years ago

mmisw commented 9 years ago

From caru...@gmail.com on September 12, 2010 15:03:22

From time to time an error like the following is reported by the ORR portal or the Ont service:

ERROR: #<storage-condition @ #x54a3731a> An allocation request for 16777232 bytes caused tenuring and a need for 337117184 more bytes of heap. The operating system will not make the space available because of a lack of swap space or some other operating system imposed limit or memory mapping collision. in ag-access-triple-store [See report file sys:agmem0.log] More details of the error are in the ORR logs

ont.log shows the following:

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
    at com.franz.agbase.AllegroGraphException.mapREx(AllegroGraphException.java:90)
    at com.franz.agbase.transport.AGDirectLink.portInOp(AGDirectLink.java:1006)
    at com.franz.agbase.transport.AGDirectLink.portInOp(AGDirectLink.java:977)
    at com.franz.agbase.transport.AGDirectLink.opResIn(AGDirectLink.java:561)
    at com.franz.agbase.transport.AGDirectLink.sendOpTail(AGDirectLink.java:538)
    at com.franz.agbase.transport.AGDirectLink.sendOp1n(AGDirectLink.java:428)
    at com.franz.agbase.transport.AGDirectConnector.access(AGDirectConnector.java:132)
    at com.franz.agbase.util.AGInternals.connect(AGInternals.java:470)
    at com.franz.agbase.AllegroGraph.<init>(AllegroGraph.java:194)
    at com.franz.agbase.AllegroGraphConnection.access(AllegroGraphConnection.java:834)
    at org.mmisw.ont.graph.allegro.OntGraphAG$Ag.<init>(OntGraphAG.java:94)
    at org.mmisw.ont.graph.allegro.OntGraphAG.executeQuery(OntGraphAG.java:621)
    at org.mmisw.ont.graph.OntGraph.executeQuery(OntGraph.java:76)
    at org.mmisw.ont.sparql.SparqlDispatcher._execute(SparqlDispatcher.java:171)
    at org.mmisw.ont.sparql.SparqlDispatcher.execute(SparqlDispatcher.java:67)
    at org.mmisw.ont.UriDispatcher._dispatchUriOntologyFormat(UriDispatcher.java:106)
    at org.mmisw.ont.UriDispatcher.dispatchEntityUri(UriDispatcher.java:85)
    at org.mmisw.ont.OntServlet._dispatchUri(OntServlet.java:556)
    at org.mmisw.ont.OntServlet.dispatch(OntServlet.java:436)
    at org.mmisw.ont.OntServlet.doGet(OntServlet.java:472)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

Original issue: http://code.google.com/p/mmisw/issues/detail?id=281

mmisw commented 9 years ago

From caru...@gmail.com on September 12, 2010 22:04:02

Sometimes GA recovers by itself (ie. after a while, a similar request succeeds); but sometimes the server needs to be re-started.

mmisw commented 9 years ago

From caru...@gmail.com on September 12, 2010 22:14:19

Labels: -Milestone-Beta1 query

mmisw commented 9 years ago

From caru...@gmail.com on September 21, 2010 10:07:21

Emailed the AG team for guidance about appropriate configuration of the AG server.

mmisw commented 9 years ago

From caru...@gmail.com on September 21, 2010 17:26:11

A complete self-contained program demonstrating the problem has been written and sent to the AG team along with a copy of the triple store. They are investigating.

Status: Started

mmisw commented 9 years ago

From caru...@gmail.com on September 22, 2010 15:02:44

I just re-created the triple store following the suggestion by AG (see below( and will continue monitoring the stability of the system.

(I have asked them whether version 4.0 of AllegroGraph has solved the related issues in version 3.3, which is the one MMI is using given the available resources).

I have put the code and a copy of the actual triple store itself in allegrograph_test.zip (151M), which I just uploaded...

The file arrived and unpacked happily. I have not yet run your program (and I need some sample queries to run AgTest.main()), but by inspecting the triple store, I can offer some initial observations.

The store appears to have 139k triples, but in fact there are over 2m triples on disk. The huge number of deleted triples appears to be the cause of the memory overflow. The store requires 60Mb of memory just to open because many internal structure are proportional to the raw triple count, and the deletion structures are proportional to the number of deleted triples. One of the components of the deletion structure is a 16Mb array (which is the immediate cause of the storage-exception in the stack dump you sent).

The deletion structures and algorithms used in AllegroGraph 3.3 do not scale well as the number of deleted triples grows. And in your case, where over 90% of the store is deleted triples, they are overwhelming the memory. In addition there may be some storage leaks in the system that seem to consume additional memory as the same store is repeatedly opened and closed. Note also that there are severe performance penalties; the deleted triples must all participate in indexing operations, and queries must sift the visible triples from the mass of raw data.

This situation is unlikely to be corrected by any simple patch to 3.3. I can only offer some suggestions on how to delay the onset of the problem. The best strategy would be to re-create the store when the number of deleted triples reaches some threshhold (say 50% of the store). Re-creation must be done by fetching the good triples from the store and adding them to a new store. When the triples to be deleted are all added in an identifyable batch, federation could be used to delete the entire batch in one step.

When I get some sample queries, I will run the AgTest program to try to isolate the storage leak. That may help delay the problem to some extent.

mmisw commented 9 years ago

From caru...@gmail.com on September 22, 2010 22:04:16

Lowering from critical to high priority.
It is expected that the complete re-creation of the store done today will make much less frequent the occurrence of the memory allocation problem; but we should continue regular monitoring to do the re-creation when necessary.

An ideal solution would be to upgrade to AG 4.0 (which is claimed to not suffer from these bugs), but this would require a 64-bit environment, either natively or virtualized.

Status: Diagnosed
Cc: jgrayb...@ucsd.edu
Labels: -Priority-Critical Priority-High

mmisw commented 9 years ago

From jgrayb...@ucsd.edu on September 23, 2010 09:14:32

are you sure the hardware doesn't support 64-bit? (because the OS is capable of 64-bit, right?)

mmisw commented 9 years ago

From caru...@gmail.com on October 19, 2010 19:29:26

Yes, MMIWEB is a 64-bit machine. I forgot to mention "linux" in my previous comment: "... but this would require a linux 64-bit environment.." (the only available AG 4.0 version at the moment runs natively on Linux x86-64; so, for lack of a linux 64 box, a linux virtual machine would be required).

I had already downloaded the 2Gb AG virtual image and a 30-day evaluation version of VMware a few days ago, but just now got a chance to try this image after enabling VNC access to the machine. The basic set-up looks good so far. A next step is to test this triple store with all ORR ontologies. If this proves to be acceptable in performance and robust in stability (compared with AG 3.3), then a next step would be to start up the linux virtual machine and the AG image from boot time.

mmisw commented 9 years ago

From caru...@gmail.com on April 02, 2012 14:38:43

AllegroGraph 4.4 is proving much more stable. Marking this as fixed.

MMI Ontology and Term URI Resolver. Version 2.0.23.beta (201204021355) ORR Portal 2.0.22.beta (201203311158)

Status: Fixed