Open GoogleCodeExporter opened 9 years ago
Some interesting observations from the log:
a) CC.log: "Address already in use"
Indicates a port being occupied.
b) CC.log: Starts with the exception listed in (a). Starting message:"INFO:
Starting ClusterControllerService:"
not found
c) NC.log: java.lang.Exception: Node with this name already registered.
Indicates a previous successful attempt of registering the node.
Khurram: a few questions:
(1) Is this reproducible (I assume this is sporadic)?
(2) When creating a new instance, did we already have another MANAGIX_HOME set
on your machine that was used to launch an asterix instance before and it being
not shut down properly? Looking for information that helps explain "Address
already in use"
Original comment by RamanGro...@gmail.com
on 26 Nov 2014 at 4:35
If you take a closer look at the steps I listed to repro the problem, you will
see there are two attempts where I try to create an instance with the same
name, managix create -n inst1, the instance creation in my first attempt did
not succeed, and my second attempt to create the instance with same name as
before succeeded. Note that I do configure and validate instance b/n the two
create instance commands.
Here is the output from jps, which tells us there is only one CC and one NC
running. I did not kill any other CC/NC process before I tried to create the
instance.
khurram@khurram:~$ jps
3792 CCDriver
8147 Jps
3156 QuorumPeerMain
3908 NCDriver
To answer your question about it being reproducible, I am trying the same steps
now, we will know if the same problem shows up.
Original comment by khfaraaz82
on 26 Nov 2014 at 7:46
Khurram, I did notice that you issued two create instance commands.
However you did not share the output from the first attempt (that is the
attempt prior to you configuring Managix) . Please share with us the output.
"Here is the output from jps, which tells us there is only one CC and one NC
running. I did not kill any other CC/NC process before I tried to create the
instance."
Did you verify if there were any processes lurking around prior to your second
attempt?
"To answer your question about it being reproducible, I am trying the same
steps now, we will know if the same problem shows up."
By same steps, do you mean making two attempts to create an instance?
Original comment by RamanGro...@gmail.com
on 26 Nov 2014 at 7:51
I did stop instance and delete instance, and tried to recreate the instance
with the same name as before. After stop+delete, I see the correct message,
Asterix instance by name inst1 already exists. So what could have caused the
NPE that we saw in my first attempt ? (Like you said this appears to be
sporadic, but is there a way to handle this scenario, so we avoid seeing the
NPE ?)
khurram@khurram:~/uci/asterixdb$ export
MANAGIX_HOME="/home/khurram/uci/asterixdb/asterix-installer/target/asterix-insta
ller-0.8.7-SNAPSHOT-binary-assembly"
khurram@khurram:~/uci/asterixdb$ export PATH=$PATH:$MANAGIX_HOME/bin
khurram@khurram:~/uci/asterixdb$ export
JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
khurram@khurram:~/uci/asterixdb$ export PATH=$JAVA_HOME/bin:$PATH
khurram@khurram:~/uci/asterixdb$ find . -name "local.xml"
./asterix-installer/src/main/resources/clusters/local/local.xml
./asterix-installer/target/classes/clusters/local/local.xml
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/clus
ters/local/local.xml
khurram@khurram:~/uci/asterixdb$
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/bin/
managix create -n inst1 -c
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/clus
ters/local/local.xml
INFO: Name:inst1
Created:Tue Nov 25 23:55:48 PST 2014
Web-Url:http://127.0.0.1:19001
State:ACTIVE
khurram@khurram:~/uci/asterixdb$
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/bin/
managix configure
khurram@khurram:~/uci/asterixdb$
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/bin/
managix validate
INFO: Environment [OK]
INFO: Managix Configuration [OK]
khurram@khurram:~/uci/asterixdb$
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/bin/
managix create -n inst1 -c
./asterix-installer/target/asterix-installer-0.8.7-SNAPSHOT-binary-assembly/clus
ters/local/local.xml
ERROR: Asterix instance by name inst1 already exists.
Original comment by khfaraaz82
on 26 Nov 2014 at 8:00
In the case when you faced NPE, you had two attempts at creating the instance.
Attempt 1: managix create ....
(Since the output you shared did not contain any response to this command, I assume that this command was killed by a CTRL-C: please confirm)
Next you configured Managix and made another attempt
Attempt2: managix create ....
This time you get an ACTIVE instance but Metadata query fails
Please confirm if I am interpreting your actions (from the logs) correctly.
Original comment by RamanGro...@gmail.com
on 26 Nov 2014 at 8:14
Yes, in my first attempt to create instance I remember having used CTRL-C
(because I thought I was running the wrong command, which wasn't the case).
Your assumption is right.
Original comment by khfaraaz82
on 26 Nov 2014 at 8:18
Thanks for confirming that my interpretation was correct.
Here is what exactly happened... ( I was able to deterministically reproduce
the scenario in consectuvie attempts that confirmed the hypothesis)
Step 1: Submit create command:
managix create -n <name of instance>
time ticking....
Step 2: CTRL-C
Abruptly end Managix create command, while it is midway launching the processes (CC and NC)
At the time of hitting CTRL-C, the processes are already been launched, are occupying ports and are writing logs.
NOTE: Before the abrupt termination, the processes were up but as Managix was terminated, it could not be told finish the execution of the create statement and record with itself the fact that the instance has been created and is up and running.
Final State: Processes are running; Instance is up! but Managix doesn't know as it was killed (CTRL-C) abruptly.
Step 3: Submit another create command.
Managix checks if the instance already exists. An uninterrupted create command puts all information in Managix's Zookeeper but since create command was itself killed abruptly, this critical step (of putting in information) could not be complete. So Managix goea ahead and attempts to create another instance.
Now in the second attempt, CC is launched (again) and finds the port in use
(from the previous attempt) and dies. NC starts up and tries to register with
CC and complains - "already registered" (remember the previous NC did that too)
and this shows up the exception in NC log. Note that the second NC is writing
to the same log file as used by the previous NC.
I was able to reproduce the scenario.
Here are the exceptions... (which are exactly the same as reported in the
original description of the issue)
INFO: Registered Runtime Functions
Nov 26, 2014 1:59:05 PM edu.uci.ics.hyracks.control.cc.ClusterControllerService
start
INFO: Started ClusterControllerService
java.net.BindException: Address already in use <========= Look here!
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:344)
at sun.nio.ch.Net.bind(Net.java:336)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
at edu.uci.ics.hyracks.ipc.impl.IPCConnectionManager.<init>(IPCConnectionManager.java:67)
at edu.uci.ics.hyracks.ipc.impl.IPCSystem.<init>(IPCSystem.java:40)
=======
Nov 26, 2014 2:04:40 PM
edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepos
itory initialize
INFO: The resource id factory is intialized with the value: 16
Nov 26, 2014 2:04:40 PM
edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepos
itory initialize
INFO: Completed the initialization of the local resource repository
java.lang.Exception: Node with this name already registered. <==========Look
Here
at edu.uci.ics.hyracks.control.cc.work.RegisterNodeWork.doRun(RegisterNodeWork.java:58)
at edu.uci.ics.hyracks.control.common.work.SynchronizableWork.run(SynchronizableWork.java:32)
at edu.uci.ics.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:122)
REMOVED transaction context for JOB ID JID:3 at time 1416990888033
ALL THIS IS FINE, but WHY NPE?
To understand the reason for NPE, let us go back to the exception:
INFO: Completed the initialization of the local resource repository
java.lang.Exception: Node with this name already registered.
When the NC from the second attempt correctly faces the above exception (node
already registered), it still goes ahead and incorrectly notifies its presence
to the CC and provides its configuration.
Look inside RegisterNodeWork.java Method signature: protected void doRun()
throws Exception
The pseduocode there is shown below.
try{
// try to register the NC
//an exception is thrown here
...
...
initialize NC configuration
} catch {
// take note of the exception
}
ccs.getApplicationContext().notifyNodeJoin(id, ncConfiguration);
In above line of code, the parameter ncConfiguration is NULL as it could not be
set because of the exception.
Now CC is told of a node join attempt and it overrides the NC configuration
with a NULL.
When the end-user submits a query, the configuration parameters for Metadata
Node (the only node in this case) is returned as NULL and boom!
FIX:
RegisterNodeWork should not notify CC of a node join if the attempt has failed.
But there is a deeper issue here.
What if the user ha used a different set of asterix-configuration,xml in the
second attempt. The asterix instance that is finally obtained is from the first
attempt which has the old configuration. Should we handle all scenarios where
an end-user hits CTRL-C and kills Managix abruptly?
Original comment by RamanGro...@gmail.com
on 26 Nov 2014 at 9:44
The answer to your question, "Should we handle all scenarios where an end-user
hits CTRL-C and kills Managix abruptly? ", is Yes we should.
Anytime the CC/NC's go down due to what ever reasons (which could be, due to
power failure, CTRL-C from user or any other reasons), such scenarios should be
handled.
Do we have any crash recovery tests ? Where we bring down the CC/NC and then
get them back UP again and try to query Metadata datasets or any other internal
datasets, after the system has recovered from a crash ?
Original comment by khfaraaz82
on 27 Nov 2014 at 1:50
"Anytime the CC/NC's go down due to what ever reasons (which could be, due to
power failure, CTRL-C from user or any other reasons), such scenarios should be
handled."
I am concerned the above case if very "different" from what is being discussed
in this issue. So let us not mix things.
NC going down abruptly (CTRL-C or power failure) is handled as part of our
recovery mechanism.
CC is the single point of failure in AsterixDB. It going down is not handled
currently.
Note that above have a longer lifetime, are prone to failures and thus need to
be protected. So let us not confuse CC/NC recovery with this issue.
Managix being killed abruptly:
There are two cases:
(i) Power Failure: In this case issue 830 does not arise as the power failure
would ofcourse kill all processes and will not leave any live processes when a
second attempt (after power is restored) is made to create the instance.
(ii) CTRL-C:
A CTRL-C after launching of create statement and at a time when both processes
have launched (CC and NC) causes this issue. This is being fixed.
What I meant to ask was very different and I am afraid has not been interpreted
correctly. Let me rephrase...
Should we have crash/recovery support for Managix commands?
Original comment by RamanGro...@gmail.com
on 27 Nov 2014 at 3:05
Yes, crash/recovery support for managix commands would help. That support will
help crash/recovery tests.
Original comment by khfaraaz82
on 28 Nov 2014 at 7:42
Raman, do you have the branch name where this was being fixed, let me know and
I can verify. Thanks.
Original comment by khfaraaz82
on 11 Dec 2014 at 3:50
Original issue reported on code.google.com by
khfaraaz82
on 26 Nov 2014 at 3:39Attachments: