Java OOM during recovery

GoogleCodeExporter commented 9 years ago

One of the NCs failed to join the cluster, and the recovery manager is 
launched. However it fails with a OOM error:

Nov 19, 2014 4:04:40 PM 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager uncaughtException
SEVERE: Uncaught Exception from thread main message: Java heap space
java.lang.OutOfMemoryError: Java heap space
    at edu.uci.ics.asterix.transaction.management.service.recovery.RecoveryManager.startRecovery(RecoveryManager.java:202)
    at edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint.start(NCApplicationEntryPoint.java:90)
    at edu.uci.ics.hyracks.control.nc.NodeControllerService.startApplication(NodeControllerService.java:314)
    at edu.uci.ics.hyracks.control.nc.NodeControllerService.start(NodeControllerService.java:257)
    at edu.uci.ics.hyracks.control.nc.NCDriver.main(NCDriver.java:44)
Nov 19, 2014 4:04:40 PM 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager stopAll
SEVERE: Attempting to stop 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@10dc942b
Nov 19, 2014 4:04:40 PM 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager stopAll
SEVERE: Lifecycle management not 
configurededu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@10dc942b

If you had data on the cluster, how do you recover when this failure occurs?

Original issue reported on code.google.com by ker...@gmail.com on 20 Nov 2014 at 1:03

GoogleCodeExporter commented 9 years ago

Try to increase the size of the heap, and see if you still see the OOM error.

Original comment by khfaraaz82 on 20 Nov 2014 at 1:08

GoogleCodeExporter commented 9 years ago

Did you try increasing the heap size and attempting to recover again? What is 
the heap size when this happens?

Original comment by ima...@uci.edu on 20 Nov 2014 at 1:08

GoogleCodeExporter commented 9 years ago

Restarting the cluster doesn't help, as one of the NC will still try to recover 
and fail. The user gets the following message:
 Asterix Cluster is in UNUSABLE state.
 One or more Node Controllers have left or haven't joined yet.
 [AsterixException]

Original comment by ker...@gmail.com on 20 Nov 2014 at 1:09

GoogleCodeExporter commented 9 years ago

That is somewhat expected as there is not much one can do at this point if an 
NC with a partition is unavailable. 

P.S. , What is the full title of this issue? "Java OOM during recovery" or 
similar?

Original comment by ima...@uci.edu on 20 Nov 2014 at 1:11

GoogleCodeExporter commented 9 years ago

I think this should be fixed.  We shouldn't have OOM at whatever situations, 
e.g., under whatever heap size setting.  The memory component or buffer cache 
should be auto-adjusted with a given heap limit.

Original comment by buyingyi@gmail.com on 20 Nov 2014 at 1:14

GoogleCodeExporter commented 9 years ago

6G for the NC and 2G for the CC, these values seems fine. 

    <property>
        <name>nc.java.opts</name>
        <value>-Xmx6144m</value>
    </property>

    <property>
        <name>cc.java.opts</name>
        <value>-Xmx2048m</value>
    </property>

Original comment by ker...@gmail.com on 20 Nov 2014 at 1:19

GoogleCodeExporter commented 9 years ago

Original comment by ker...@gmail.com on 20 Nov 2014 at 1:20

Changed title: Java OOM during recovery

GoogleCodeExporter commented 9 years ago

This should be fixed.
In the current implementation, this situation may occur when there are many 
entity-level commits in a job.
Non-sharp(or soft) checkpoint may reduce this OOM chance during recovery 
(currently periodic checkpoint is not enabled), but recovery manager should 
deal with this situation by having an ability to spill on disk when it's 
necessary.

Original comment by kiss...@gmail.com on 20 Nov 2014 at 1:28

namhnguyen / asterixdb

Java OOM during recovery #826