Closed GoogleCodeExporter closed 9 years ago
Original comment by pouria.p...@gmail.com
on 21 Apr 2014 at 11:15
Changing log.level in "asterix-configuration.xml" from "WARNING" back to the
default one "INFO" made start-up successful. But starting asterix with
"WARNING" log level always fails.
(Full CC/NC logs are attached)
Original comment by pouria.p...@gmail.com
on 22 Apr 2014 at 1:03
Attachments:
From the log reports, it appears that all NC JVMs are running but not all of
them have registered.
An NC stuck/hung during startup would not register with the CC. This would
result in an UNUSABLE state as we require all NCs to register before the
AsterixDB instance can accept any DDL/DML.
The root cause lies in NC entering a hung state in executing the start up
logic.
Stay tuned as I investigate this
Original comment by RamanGro...@gmail.com
on 22 Apr 2014 at 9:14
Okay - anxiously staying tuned. :-)
Original comment by dtab...@gmail.com
on 22 Apr 2014 at 10:17
Here is jstack, dumped after hanging, for CC, one of the NCs, and the NC on
metadata node - (in case it helps).
Original comment by pouria.p...@gmail.com
on 22 Apr 2014 at 11:52
Attachments:
Young-Seouk,
As reported in the issue, the cluster enters an UNUSABLE state as all NCs have
not registered with the CC. From the logs of these NCs, it appears that that
the NCs are stuck...
Apr 21, 2014 4:20:22 PM edu.uci.ics.hyracks.control.nc.NCDriver main
SEVERE: Setting uncaught exception handler
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@3bcdae24
Apr 21, 2014 4:20:22 PM edu.uci.ics.hyracks.control.nc.NodeControllerService
start
INFO: Starting NodeControllerService
Apr 21, 2014 4:20:22 PM
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller TAKE NOTE: cs10_asterix-1
<----End of Logs ---->
From the source code, a critical log message that is missing is the following:-
// #. recover if the system is corrupted by checking system state.
IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
systemState = recoveryMgr.getSystemState();
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("System is in a state: " + systemState);
}
We expect the system to be able to ascertain its "state". However the NC is
unable to do so.
I am assigning this issue to you to investigate any possible bug in determining
the state of the system that could leave the NC in a hung state.
Original comment by RamanGro...@gmail.com
on 27 Apr 2014 at 5:15
According to the attached log file, cs10_asterix-1.log and the source code of
NCApplicationEntryPoint.start() method, NC didn't even pass the line,
"runtimeContext.initialize()".
The reason is that once the runtimeContext.initialize() is called and returned
successfully, the nc log must include messages related to the transaction log
file initialization since runtimeContext.initialize() creates the
TransactionSubsystem instance that creates LogManager instance that writes to
the nc log file log messages that indicate where the available transaction log
file is created.
Thus, I reset the owner of this issue to Raman.
------------- source code of NCApplicationEntryPoint.start() method
---------------
public void start(INCApplicationContext ncAppCtx, String[] args) throws
Exception {
ncAppCtx.setThreadFactory(new AsterixThreadFactory(ncAppCtx.getLifeCycleComponentManager()));
ncApplicationContext = ncAppCtx;
nodeId = ncApplicationContext.getNodeId();
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("Starting Asterix node controller TAKE NOTE: " + nodeId);
}
runtimeContext = new AsterixAppRuntimeContext(ncApplicationContext);
AsterixMetadataProperties metadataProperties = ((IAsterixPropertiesProvider) runtimeContext)
.getMetadataProperties();
if (!metadataProperties.getNodeNames().contains(ncApplicationContext.getNodeId())) {
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("Substitute node joining : " + ncApplicationContext.getNodeId());
}
updateOnNodeJoin();
}
runtimeContext.initialize();
ncApplicationContext.setApplicationObject(runtimeContext);
// #. recover if the system is corrupted by checking system state.
IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
systemState = recoveryMgr.getSystemState();
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("System is in a state: " + systemState);
}
Original comment by kiss...@gmail.com
on 28 Apr 2014 at 3:11
Young-Seok,
are you sure that those log messages should be there? Pouria mentioned in his
second comment that this problem only appears, if the logging level is set to
WARNING. So a number of log messages that are usually there might not be there.
E.g. the logging in LogManager.initializeLogAnchor seems to happen at INFO
level, so we would not see that.
Original comment by westm...@gmail.com
on 29 Apr 2014 at 9:12
If you take a look at the cs10_asterix-1.log, the log includes the following
message which is written in INFO level. So, since it was INFO level, the log
file should have the log file initialization messages.
----------------------
Apr 21, 2014 4:20:22 PM
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller TAKE NOTE: cs10_asterix-1
----------------------
Original comment by kiss...@gmail.com
on 29 Apr 2014 at 9:34
YS,
As you pointed out that the method runtimeContext.initialize() did not return
successfully (evident from absence of log messages), it becomes important to
investigate why this method never returns. This method initializes the buffer
cache, log manager and the transaction sub-system.
The sequence of log messages we expect during a normal start up are as follows
=========================
edu.uci.ics.asterix.transaction.management.service.logging.LogManager
initializeLogAnchor
INFO: log file Id: 3, offset: 0
May 02, 2014 1:48:20 AM
edu.uci.ics.asterix.transaction.management.service.logging.LogManager
initializeLogManager
INFO: LogManager starts logging in LSN: 6442450944
May 02, 2014 1:48:20 AM
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
============================
Following the source code, the method (runtimeContext.initialize()) constructs
a bunch of objects and invokes logManager.initialize. Subsequent to
initialization, the log manager outputs the starting LSN.
We do not see this message. The steps prior to the initialization of LogManager
basically construct POJOs and seemed harmless. Please have a look
Original comment by RamanGro...@gmail.com
on 2 May 2014 at 4:13
I give this Raman back again as I mentioned in the meeting. :)
Original comment by kiss...@gmail.com
on 2 May 2014 at 5:45
YS, for tracking purpose, please update the bug with your comments.
something that is more helpful than "what I mentioned in meeting" :)
Original comment by RamanGro...@gmail.com
on 2 May 2014 at 5:56
According to Raman's request, I checked source code related to the NC
initialization and the log file that Pouria attached here. According to the log
file, the system didn't even go through the transaction module initialization
and must be failed somewhere in between the lines " if
(LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("Starting Asterix node controller TAKE NOTE: " + nodeId);
}" and "ncApplicationContext.setApplicationObject(runtimeContext);" of source code shown below.
So, this situation didn't even involve recovery operation, which means that the
recovery is not the cause of this issue situation. Thus, I give this back to
Raman.
Original comment by kiss...@gmail.com
on 2 May 2014 at 6:34
The source code is shown below.
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("Starting Asterix node controller TAKE NOTE: " + nodeId);
}
runtimeContext = new AsterixAppRuntimeContext(ncApplicationContext);
AsterixMetadataProperties metadataProperties = ((IAsterixPropertiesProvider) runtimeContext)
.getMetadataProperties();
if (!metadataProperties.getNodeNames().contains(ncApplicationContext.getNodeId())) {
if (LOGGER.isLoggable(Level.INFO)) {
LOGGER.info("Substitute node joining : " + ncApplicationContext.getNodeId());
}
updateOnNodeJoin();
}
runtimeContext.initialize();
ncApplicationContext.setApplicationObject(runtimeContext);
// #. recover if the system is corrupted by checking system state.
IRecoveryManager recoveryMgr = runtimeContext.getTransactionSubsystem().getRecoveryManager();
systemState = recoveryMgr.getSystemState();
Original comment by kiss...@gmail.com
on 2 May 2014 at 6:35
I keep hitting this issue again, which simply has blocked me completely to run
performance tests. I have attached the logs, which this time show some
exceptions. It seems that RegisterNodeWork() has some problems, although NC
processes are up when verified via jps.
Original comment by pouria.p...@gmail.com
on 19 May 2014 at 11:38
Attachments:
On executing, CC executes the application (Asterix) startup logic which is the
start() method call inside CCApplicationEntryPoint.
It is essential that the start up logic is completed prior to CC accepting any
other requests that includes node registration. Only when CC service has
started (marked by log message "Started Cluster Controller Service" should any
other requests be accepted. Currently this is not the case.
Two cases have been observed because of this.
a) NPE: This occurs when CC's start up logic hasn't executed far enough to
initialized the CCApplicationContext. Node registration requires this object to
extract further information and hits a NPE which is communicated to back to NC
and surfaces in NC logs.
b) (Apparently) hung state with no exceptions: This occurs when CC's start up
logic hasn't executed far enough to register cluster event listeners, but has
gone past (a) to initialize the required objects. So you do not get any
exceptions. However, because listener were not registered at the time of node
registration, it misses the critical registration event and believes the
node(s) have not joined in yet.
Fix: (a) and (b) are just two different forms of 758. Till's fix to queue up
all requests to CC until it is up (has completed application.start()) would fix
758. Let us plan to get Till's fix merged to master (pls send a review
request).
I'd suggest lets push the change to Pouria's branch to unblock him.
Original comment by RamanGro...@gmail.com
on 22 May 2014 at 9:22
@Till/Pouria: Why does the NC--and not the CC--throw the NPE? Seems that only
the CC could ever throw this exception, unless I'm missing something?
java.lang.NullPointerException
at edu.uci.ics.hyracks.control.cc.work.RegisterNodeWork.doRun(RegisterNodeWork.java:74)
at edu.uci.ics.hyracks.control.common.work.SynchronizableWork.run(SynchronizableWork.java:32)
at edu.uci.ics.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:116)
Original comment by zheilb...@gmail.com
on 24 May 2014 at 6:27
Zach, although the logic, RegisterNodeWork is executed at CC, the exception is
caught and transported back to the NC as a "registrationException". The
following method is invoked.
private synchronized void setNodeRegistrationResult(NodeParameters parameters, Exception exception) {
this.nodeParameters = parameters;
this.registrationException = exception;
this.registrationPending = false;
notifyAll();
}
It is just that the trace is printed at NC that makes that exception appear to
have originated at NC.
Original comment by RamanGro...@gmail.com
on 24 May 2014 at 1:43
I suspected that was the case and even looked for it, but seems I missed this
line of code:
try {
// ...
} catch () {
result = new CCNCFunctions.NodeRegistrationResult(null, e);
}
In any case, you're right.
Original comment by zheilb...@gmail.com
on 24 May 2014 at 7:30
A fix has been merged to master.
Please validate that the problem does not happen anymore.
Original comment by westm...@gmail.com
on 27 May 2014 at 3:37
Original issue reported on code.google.com by
pouria.p...@gmail.com
on 21 Apr 2014 at 11:14