ilovesoup / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Need to clean up AsterixDB behavior for early adopters with MacBooks #375

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

 - Run AsterixDB on a Mac laptop using managix and the Web UI
 - Let it go to sleep overnight

What you get is one of:

(1) Trying to talk to it through the UI hangs, but stopping and restarting 
resolves it
- or -
(2) Stopping and restarting doesn't resolve it and you are hosed:
managix stop -n my_asterix
INFO: Stopped Asterix instance: my_asterix

==> managix start -n my_asterix
INFO: Name:my_asterix
Created:Sun Apr 21 00:45:11 PDT 2013
Web-Url:http://127.0.0.1:19001
State:UNUSABLE

WARNING!:Cluster Controller not running at master

Issues here are:
(1) We *will* have users doing this even though it's not our design point
(2) This makes for a clunky out of box experience, I worry
(3) Having UNUSABLE is bad and it's not clear how to work out of that one

Original issue reported on code.google.com by dtab...@gmail.com on 21 Apr 2013 at 2:56

GoogleCodeExporter commented 9 years ago
I tried reproducing the scenario:-
I started Asterix using Managix, opened web-interface and ran a query. The 
instance is ACTIVE and I have the web-interface open when I do the following:-
Attempt 1: Manually put mac book to sleep 
Break the sleep using the usual way. 

Attempt 2: Modify sleep timer to the minimum (1 min) and watch the Macbook go 
to sleep
Break the sleep as before.

In either case, I was not able to reproduce the scenario. The web-interface was 
accessible and I could continue executing queries. 

Mike is there something I am missing or probably I can get access to your 
system when we meet next and understand the difference in behavior. 

Original comment by RamanGro...@gmail.com on 21 Apr 2013 at 5:31

GoogleCodeExporter commented 9 years ago
Here are the logs in the case where (1) worked but it was hung w/o that.

Original comment by dtab...@gmail.com on 21 Apr 2013 at 9:30

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by dtab...@gmail.com on 21 Apr 2013 at 9:30

Attachments:

GoogleCodeExporter commented 9 years ago
Got the logs this time we hope - here is the new CC log

Original comment by dtab...@gmail.com on 22 Apr 2013 at 7:53

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 24 May 2013 at 8:38

GoogleCodeExporter commented 9 years ago
Here is the status AFAMIC when the system has changed networks:

==> managix describe -n my_asterix
INFO: Name:my_asterix
Created:Fri May 31 13:37:21 PDT 2013
Web-Url:http://127.0.0.1:19001
State:ACTIVE (Fri May 31 16:01:47 PDT 2013)

So - UNUSABLE is not the norm. :-)

Original comment by dtab...@gmail.com on 31 May 2013 at 11:27

GoogleCodeExporter commented 9 years ago

Original comment by vinay...@gmail.com on 31 May 2013 at 11:40

GoogleCodeExporter commented 9 years ago
As discussed, marking it as 'Won't fix' but would re-open this if the issue is 
seen again. 

Original comment by RamanGro...@gmail.com on 16 Nov 2013 at 2:54

GoogleCodeExporter commented 9 years ago
I've seen this "UNSTABLE state" problem after rebooting my machine. The cc log 
does not have anything useful, but the latest log was on January 20th which is 
also my last reboot date, and didn't have anything within the following 7 days. 
Can you check if you will be able to reproduce this problem by rebooting your 
machine while the web-interface was accessible?

Btw, running "managix shutdown" changed the status to "INACTIVE", then I could 
use "managix start" to restart my instance. It might be a good idea to add this 
troubleshooting to FAQ's.

Original comment by icetin...@gmail.com on 27 Jan 2014 at 11:44

GoogleCodeExporter commented 9 years ago

managix shutdown simply shuts the backend zookeeper service gracefully and is 
useful when one is not using AsterixDB and does not want any daemon management 
process to continue running in the background.  It does not touch an AsterixDB 
instance. 

Post the shutdown command, the start command would internally translate to a) 
start back-end zookeeper as its not running b) start the asterix instance. Step 
(a) here allows zookeeper to run local recovery based on its own logs and 
ensures that all data is consistent. Post recovery of Zookeeper state, when an 
Asterix instance is started, no issues are observed in updating and reading the 
state maintained in Zookeeper.

I would attempt at replicating what you observed. 

Original comment by ram...@uci.edu on 28 Jan 2014 at 5:16

GoogleCodeExporter commented 9 years ago
Hello, I tried to follow the steps on 
http://asterix.ics.uci.edu/documentation/install.html and after the step 
managix create -n my_asterix -c $MANAGIX_HOME/clusters/local/local.xml

I get:

INFO: Name:my_asterix
Created:Sat Apr 26 23:06:08 EDT 2014
Web-Url:http://127.0.0.1:19001
State:UNUSABLE

WARNING!:Cluster Controller not running at master

Original comment by getajo...@gmail.com on 27 Apr 2014 at 3:14

GoogleCodeExporter commented 9 years ago

Original comment by dtab...@gmail.com on 27 Apr 2014 at 4:15

GoogleCodeExporter commented 9 years ago
I assume the validation step (managix validate -c <path to cluster 
configuration xml>) was successful. 
I am missing some critical information in ascertaining the reason for failure. 
Can you please provide the logs. 
If you have not changed the local.xml file that is auto-generated, these logs 
would be found at
$MANAGIX_HOME/clusters/local/working_dir/logs/ 

It would be helpful if we can set up a Skype session (skype id: raman-grover). 
It should not take long to look at the environment and figure out the cause. 
Please let me know your availability (morning or evening slots are preferred as 
I am currently in the Indian Time Zone (PDT + 12:30)).  

Original comment by ram...@uci.edu on 27 Apr 2014 at 5:02

GoogleCodeExporter commented 9 years ago
I feel like the answer may be trivial (hopefully).  Unfortunately I'm a lowly 
CS student who doesn't understand much about the problem or AsterixDB in 
general.

Full context: I was first having some issues with my Java version (as things 
were defaulting to the Apple Java 1.6 version).  After I got that fixed I was 
getting this error.  I ended up deleting everything in my asterix_mgmt 
directory and retrying with a fresh download, so the logs will not give the 
full story.

Anyway, I'm available most times tomorrow for a Skype call.  I'm on the Eastern 
Time Zone (PDT + 3:00)

Original comment by getajo...@gmail.com on 27 Apr 2014 at 3:30

Attachments:

GoogleCodeExporter commented 9 years ago
From you CC logs,
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:414)
    at sun.nio.ch.Net.bind(Net.java:406)

The above error suggests that you are already having a process that is 
occupying the port 1098. 

From your NC logs:
NFO: Completed sharp checkpoint.
java.lang.Exception: Node with this name already registered.
    at edu.uci.ics.hyracks.control.cc.work.RegisterNodeWork.doRun(RegisterNodeWork.java:58)
    at edu.uci.ics.hyracks.control.common.work.SynchronizableWork.run(SynchronizableWork.java:32)
    at edu.uci.ics.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:116)

From your NC logs, it shows that the NC was able to connect to a CC process but 
the CC had already received a ping from an NC with same name. 

Looking at the logs from CC and NC, it seems to be the case that you already 
have processes running. These were daemons started as part of an initial 
attempt to create an asterix instance. Did you stop the previous instance 
before wiping out stuff (MANAGIX_HOME) before attempting a re-install. 

Executing a jps on the command prompt would listify the running java processed 
from your account. 
These should not have CCDriver or NCDriver before you attempt to create an 
instance. Please confirm.

Original comment by RamanGro...@gmail.com on 27 Apr 2014 at 4:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
My memory may be failing me.  If I remember correctly, I attempted to managix 
stop my_asterix before reinstalling, but the system wouldn't allow me to 
because it was UNUSABLE.  However, that may have been the message when I tried 
to start?  Anyway...

Now that I'm at the command prompt again, I can officially managix stop 
my_asterix.  After doing so, jps lists CCDriver (but not NCDriver).

Original comment by getajo...@gmail.com on 27 Apr 2014 at 4:35

GoogleCodeExporter commented 9 years ago
managix stop -n <name of instance> is a valid command in the UNUSABLE state. 
It transforms the instance to INACTIVE state after terminating whichever 
daemons are alive.

as you reported, jps is showing CCDrive process, I would ask you to terminate 
it using kill -9 <process id>
where the process id is shown as the output of the jps command. 

Once you have the clean system (jps does not show CCDriver/NCDriver),  please 
go ahead and re try. 
Also if possible, ping me on skype (raman-grover) and I can have a live 
(support) session.  

Original comment by RamanGro...@gmail.com on 27 Apr 2014 at 4:52

GoogleCodeExporter commented 9 years ago
I can get on skype in about an hour.  In the mean time:

working_dir$ jps
2488 Jps

working_dir$ managix create -n my_asterix -c 
/Users/cameronbasham/s/databases/o/asterix-mgmt/clusters/local/local.xml 
INFO: Name:my_asterix
Created:Sun Apr 27 13:37:04 EDT 2014
Web-Url:http://127.0.0.1:19001
State:UNUSABLE

WARNING!:Cluster Controller not running at master
Node Controller not running at the following nodes
127.0.0.1

Original comment by getajo...@gmail.com on 27 Apr 2014 at 5:39

Attachments:

GoogleCodeExporter commented 9 years ago
form the logs, I probably know whats happening here, would wait for you to be 
online...

Original comment by RamanGro...@gmail.com on 27 Apr 2014 at 6:49

GoogleCodeExporter commented 9 years ago
I thought I added you, but I've yet to get a reply.  I'm getajob92 on skype.

Original comment by getajo...@gmail.com on 27 Apr 2014 at 6:52