Managix can not start Asterix instance on cluster

GoogleCodeExporter commented 9 years ago

Using latest Asterix on cluster, Managix is unable to start an already existing 
(and loaded) instance.
CC and NC processes are up, but asterix is unusable. This issue is observed 
both on asterix cluster and rainbow cluster.
It seems that it could be a race condition in CC start-up and the time NCs join 
the cluster. On rainbow cluster, a cluster restart could solve the problem (at 
least start-up attempt reported to be successful after reboot). On Asterix 
cluster, even reboot could not solve the issue. 

On the logs front:

1) On NC logs, it seems they are stuck in start-up process:

Apr 21, 2014 3:58:43 PM edu.uci.ics.hyracks.control.nc.NodeControllerService 
start
INFO: Starting NodeControllerService
Apr 21, 2014 3:58:43 PM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller  TAKE NOTE: cs10_asterix-1

2) On CC logs, configurations params seem to be lost (the following messages 
keeps repeating for all NCs on all nodes):

4/04/21 15:58:44 INFO util.log: Started SelectChannelConnector@0.0.0.0:19003 
STARTING
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-9 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-8 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-7 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-6 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-5 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-4 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
WARNING: Configuration parameters for nodeId cs10_asterix-3 not found. The node 
has not joined yet or has left.
Apr 21, 2014 3:58:44 PM edu.uci.ics.asterix.om.util.AsterixClusterProperties 
getNumberOfIODevices
....

Original issue reported on code.google.com by pouria.p...@gmail.com on 21 Apr 2014 at 11:14

GoogleCodeExporter commented 9 years ago

Original comment by pouria.p...@gmail.com on 21 Apr 2014 at 11:15

GoogleCodeExporter commented 9 years ago

Changing log.level in "asterix-configuration.xml" from "WARNING" back to the 
default one "INFO" made start-up successful.  But starting asterix with 
"WARNING" log level always fails.

(Full CC/NC logs are attached)

Original comment by pouria.p...@gmail.com on 22 Apr 2014 at 1:03

Attachments:

GoogleCodeExporter commented 9 years ago

From the log reports, it appears that all NC JVMs are running but not all of 
them have registered.
An NC stuck/hung during startup would not register with the CC. This would 
result in an UNUSABLE state as we require all NCs to register before the 
AsterixDB instance can accept any DDL/DML. 

The root cause lies in NC entering a hung state in executing the start up 
logic. 
Stay tuned as I investigate this

Original comment by RamanGro...@gmail.com on 22 Apr 2014 at 9:14

GoogleCodeExporter commented 9 years ago

Okay - anxiously staying tuned. :-)

Original comment by dtab...@gmail.com on 22 Apr 2014 at 10:17

GoogleCodeExporter commented 9 years ago

Here is jstack, dumped after hanging, for CC, one of the NCs, and the NC on 
metadata node - (in case it helps).

Original comment by pouria.p...@gmail.com on 22 Apr 2014 at 11:52

Attachments:

GoogleCodeExporter commented 9 years ago

Young-Seouk,

As reported in the issue, the cluster enters an UNUSABLE state as all NCs have 
not registered with the CC. From the logs of these NCs, it appears that that 
the NCs are stuck...

Apr 21, 2014 4:20:22 PM edu.uci.ics.hyracks.control.nc.NCDriver main

SEVERE: Setting uncaught exception handler 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@3bcdae24
Apr 21, 2014 4:20:22 PM edu.uci.ics.hyracks.control.nc.NodeControllerService 
start
INFO: Starting NodeControllerService
Apr 21, 2014 4:20:22 PM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller  TAKE NOTE: cs10_asterix-1

 <----End of Logs ----> 

From the source code, a critical log message that is missing is the following:-

        // #. recover if the system is corrupted by checking system state.
        IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
        systemState = recoveryMgr.getSystemState();

        if (LOGGER.isLoggable(Level.INFO)) {
            LOGGER.info("System is in a state: " + systemState);
        }

We expect the system to be able to ascertain its "state". However the NC is 
unable to do so. 
I am assigning this issue to you to investigate any possible bug in determining 
the state of the system that could  leave the NC in a hung state.

Original comment by RamanGro...@gmail.com on 27 Apr 2014 at 5:15

GoogleCodeExporter commented 9 years ago

According to the attached log file, cs10_asterix-1.log and the source code of 
NCApplicationEntryPoint.start() method, NC didn't even pass the line, 
"runtimeContext.initialize()".
The reason is that once the runtimeContext.initialize() is called and returned 
successfully, the nc log must include messages related to the transaction log 
file initialization since runtimeContext.initialize() creates the 
TransactionSubsystem instance that creates LogManager instance that writes to 
the nc log file log messages that indicate where the available transaction log 
file is created.  

Thus, I reset the owner of this issue to Raman. 

------------- source code of NCApplicationEntryPoint.start() method 
---------------
public void start(INCApplicationContext ncAppCtx, String[] args) throws 
Exception {
        ncAppCtx.setThreadFactory(new AsterixThreadFactory(ncAppCtx.getLifeCycleComponentManager()));
        ncApplicationContext = ncAppCtx;
        nodeId = ncApplicationContext.getNodeId();
        if (LOGGER.isLoggable(Level.INFO)) {
            LOGGER.info("Starting Asterix node controller  TAKE NOTE: " + nodeId);
        }

        runtimeContext = new AsterixAppRuntimeContext(ncApplicationContext);
        AsterixMetadataProperties metadataProperties = ((IAsterixPropertiesProvider) runtimeContext)
                .getMetadataProperties();
        if (!metadataProperties.getNodeNames().contains(ncApplicationContext.getNodeId())) {
            if (LOGGER.isLoggable(Level.INFO)) {
                LOGGER.info("Substitute node joining : " + ncApplicationContext.getNodeId());
            }
            updateOnNodeJoin();
        }
        runtimeContext.initialize();
        ncApplicationContext.setApplicationObject(runtimeContext);

        // #. recover if the system is corrupted by checking system state.
        IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
        systemState = recoveryMgr.getSystemState();

        if (LOGGER.isLoggable(Level.INFO)) {
            LOGGER.info("System is in a state: " + systemState);
        }

Original comment by kiss...@gmail.com on 28 Apr 2014 at 3:11

GoogleCodeExporter commented 9 years ago

Young-Seok,

are you sure that those log messages should be there? Pouria mentioned in his 
second comment that this problem only appears, if the logging level is set to 
WARNING. So a number of log messages that are usually there might not be there. 
E.g. the logging in LogManager.initializeLogAnchor seems to happen at INFO 
level, so we would not see that.

Original comment by westm...@gmail.com on 29 Apr 2014 at 9:12

GoogleCodeExporter commented 9 years ago

If you take a look at the cs10_asterix-1.log, the log includes the following 
message which is written in INFO level. So, since it was INFO level, the log 
file should have the log file initialization messages. 

----------------------
Apr 21, 2014 4:20:22 PM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller  TAKE NOTE: cs10_asterix-1
----------------------

Original comment by kiss...@gmail.com on 29 Apr 2014 at 9:34

GoogleCodeExporter commented 9 years ago

YS,
As you pointed out that the method runtimeContext.initialize() did not return 
successfully (evident from absence of log messages), it becomes important to 
investigate why this method never returns. This method initializes the buffer 
cache, log manager and the transaction sub-system. 
The sequence of log messages we expect during a normal start up are as follows

=========================
edu.uci.ics.asterix.transaction.management.service.logging.LogManager
initializeLogAnchor
INFO: log file Id: 3, offset: 0
May 02, 2014 1:48:20 AM
edu.uci.ics.asterix.transaction.management.service.logging.LogManager
initializeLogManager
INFO: LogManager starts logging in LSN: 6442450944
May 02, 2014 1:48:20 AM
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
============================

Following the source code, the method (runtimeContext.initialize()) constructs 
a bunch of objects and invokes logManager.initialize. Subsequent to 
initialization, the log manager outputs the starting LSN.
We do not see this message. The steps prior to the initialization of LogManager 
basically construct POJOs and seemed harmless.  Please have a look

Original comment by RamanGro...@gmail.com on 2 May 2014 at 4:13

GoogleCodeExporter commented 9 years ago

I give this Raman back again as I mentioned in the meeting. :)

Original comment by kiss...@gmail.com on 2 May 2014 at 5:45

GoogleCodeExporter commented 9 years ago

YS, for tracking purpose, please update the bug  with your comments.
something that is more helpful than "what I mentioned in meeting" :)

Original comment by RamanGro...@gmail.com on 2 May 2014 at 5:56

GoogleCodeExporter commented 9 years ago

According to Raman's request, I checked source code related to the NC 
initialization and the log file that Pouria attached here. According to the log 
file, the system didn't even go through the transaction module initialization 
and must be failed somewhere in between the lines "        if 
(LOGGER.isLoggable(Level.INFO)) {
            LOGGER.info("Starting Asterix node controller  TAKE NOTE: " + nodeId);
        }" and "ncApplicationContext.setApplicationObject(runtimeContext);" of source code shown below. 

So, this situation didn't even involve recovery operation, which means that the 
recovery is not the cause of this issue situation.  Thus, I give this back to 
Raman.

Original comment by kiss...@gmail.com on 2 May 2014 at 6:34

GoogleCodeExporter commented 9 years ago

The source code is shown below. 

        if (LOGGER.isLoggable(Level.INFO)) {
            LOGGER.info("Starting Asterix node controller  TAKE NOTE: " + nodeId);
        }

        runtimeContext = new AsterixAppRuntimeContext(ncApplicationContext);
        AsterixMetadataProperties metadataProperties = ((IAsterixPropertiesProvider) runtimeContext)
                .getMetadataProperties();
        if (!metadataProperties.getNodeNames().contains(ncApplicationContext.getNodeId())) {
            if (LOGGER.isLoggable(Level.INFO)) {
                LOGGER.info("Substitute node joining : " + ncApplicationContext.getNodeId());
            }
            updateOnNodeJoin();
        }
        runtimeContext.initialize();
        ncApplicationContext.setApplicationObject(runtimeContext);

        // #. recover if the system is corrupted by checking system state.
        IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
        systemState = recoveryMgr.getSystemState();

Original comment by kiss...@gmail.com on 2 May 2014 at 6:35

GoogleCodeExporter commented 9 years ago

I keep hitting this issue again, which simply has blocked me completely to run 
performance tests. I have attached the logs, which this time show some 
exceptions. It seems that RegisterNodeWork() has some problems, although NC 
processes are up when verified via jps.

Original comment by pouria.p...@gmail.com on 19 May 2014 at 11:38

Attachments:

log_Mon_May_19_16:33:53_PDT_2014.zip

GoogleCodeExporter commented 9 years ago

On executing, CC executes the application (Asterix) startup logic which is the 
start() method call inside CCApplicationEntryPoint. 
It is essential that the start up logic is completed prior to CC accepting any 
other requests that includes node registration. Only when CC service has 
started (marked by log message "Started Cluster Controller Service" should any 
other requests be accepted. Currently this is not the case. 

Two cases have been observed because of this.
a) NPE: This occurs when CC's start up logic hasn't executed far enough to 
initialized the CCApplicationContext. Node registration requires this object to 
extract further information and hits a NPE which is communicated to back to NC 
and surfaces in NC logs. 

b) (Apparently) hung state with no exceptions: This occurs when CC's start up 
logic hasn't executed far enough to register cluster event listeners, but has 
gone past (a) to initialize the required objects. So you do not get any 
exceptions. However, because listener were not registered at the time of node 
registration, it misses the critical registration event and believes the 
node(s) have not joined in yet. 

Fix: (a) and (b) are just two different forms of 758. Till's fix to queue up 
all requests to CC until it is up (has completed application.start()) would fix 
758. Let us plan to get Till's fix merged to master (pls send a review 
request). 
I'd suggest lets push the change to Pouria's branch to unblock him.

Original comment by RamanGro...@gmail.com on 22 May 2014 at 9:22

GoogleCodeExporter commented 9 years ago

@Till/Pouria: Why does the NC--and not the CC--throw the NPE? Seems that only 
the CC could ever throw this exception, unless I'm missing something?

java.lang.NullPointerException
    at edu.uci.ics.hyracks.control.cc.work.RegisterNodeWork.doRun(RegisterNodeWork.java:74)
    at edu.uci.ics.hyracks.control.common.work.SynchronizableWork.run(SynchronizableWork.java:32)
    at edu.uci.ics.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:116)

Original comment by zheilb...@gmail.com on 24 May 2014 at 6:27

GoogleCodeExporter commented 9 years ago

Zach, although the logic, RegisterNodeWork is executed at CC, the exception is 
caught and transported back to the NC as a "registrationException". The 
following method is invoked.  
 private synchronized void setNodeRegistrationResult(NodeParameters parameters, Exception exception) {
        this.nodeParameters = parameters;
        this.registrationException = exception;
        this.registrationPending = false;
        notifyAll();
    }

It is just that the trace is printed at NC that makes that exception appear to 
have originated at NC.

Original comment by RamanGro...@gmail.com on 24 May 2014 at 1:43

GoogleCodeExporter commented 9 years ago

I suspected that was the case and even looked for it, but seems I missed this 
line of code:

try {
  // ...
} catch () {
  result = new CCNCFunctions.NodeRegistrationResult(null, e);
}

In any case, you're right.

Original comment by zheilb...@gmail.com on 24 May 2014 at 7:30

GoogleCodeExporter commented 9 years ago

A fix has been merged to master.

Please validate that the problem does not happen anymore.

Original comment by westm...@gmail.com on 27 May 2014 at 3:37

Changed state: Fixed

lwhay / asterixdb

Managix can not start Asterix instance on cluster #758