stardog-union / stardog-graviton

Stardog Graviton has been deprecated. Please see our helm charts for similar functionality on Kubernetes.
http://www.stardog.com
Apache License 2.0
5 stars 5 forks source link

Healthcheck failure, stardog never converge #69

Open earthquakesan opened 6 years ago

earthquakesan commented 6 years ago

I have run:

stardog-graviton launch stardog-cloud

And after initialization of the volumes and VMs, I've got timeout on waiting for stardog to be up...

When I run status command:

 % stardog-graviton status stardog-cloud
The instance is not healthy
Stardog is available here: http://bla.amazonaws.com:5821
Stardog is internally available here: http://internal-bla.amazonaws.com:5821
ssh is available here: bla.amazonaws.com
Failed: Failed do the post Get http://bla.amazonaws.com:5821/admin/cluster: dial tcp 52.29.xxx.xxx:5821: i/o timeout
Please check the log files:
    /home/ivan/.graviton/logs/graviton.log
    /home/ivan/.graviton/deployments/stardog-cloud/logs/graviton.log

I can connect to instance via SSH, also 3 ZK, 3 Stardog and 1 Bastion nodes are running and visible via AWS console.

pdmars commented 6 years ago

This can be caused by a number of different issues. Usually the first thing to check are the Stardog logs themselves to see if the nodes are unable to start for some reason. Are you able to get the logs (stardog-graviton logs) of the Stardog nodes to see if there are any errors in them?

earthquakesan commented 6 years ago

@pdmars it gives me logs/ folder with empty graviton.log file.

earthquakesan commented 6 years ago

I have opened another issue for logs command.

I have ssh'ed into one of the stardog nodes and look into the logs directly. There is a problem with license (I redirected it to responsible colleagues). Would be good if you can catch and report it directly in graviton and not just doing healthchecks. E.g. parsing logs and telling user that the cluster was not able to get up, because your license has expired.

earthquakesan commented 6 years ago

@pdmars hi Paul!

I've got new license. Graviton is still failing for the proper startup. The command I use is:

stardog-graviton launch --volume-size 25 --env="STARDOG_JAVA_ARGS=\"-Xms12g -Xmx12g -XX:MaxDirectMemorySize=12g\"" --sd-instance-type="r4.large" stardog-cloud

The stdout is:

Creating the new deployment stardog-cloud
- Initializing terraform......
- Calling out to terraform to create the volumes...
- Calling out to terraform to stop builder instances...
Successfully created the volumes.
- Initializing terraform...
- Creating the instance VMs......
Successfully created the instance.
Waiting for stardog to come up...
The instance is healthy
\ Opening the firewall......
Successfully opened up the instance.
Failed: Timed out waiting for all the cluster nodes
Please check the log files:
    /home/ivan/.graviton/logs/graviton.log
    /home/ivan/.graviton/deployments/stardog-cloud/logs/graviton.log

As I can see from the logs Stardog is up and running. What could be the problem for the launch command and how it can be solved?

earthquakesan commented 6 years ago

Ok, the license I have right now is not eligible for cluster version. Two out of three stardog nodes will not get up. That's the reason for the script failure.

P.S. Again, I had to ssh into all the instances and see what's up there in the logs. Please propagate all the logs to the user in case of launch timeout. That would save me a lot of time.

pdmars commented 6 years ago

Good to hear you figured out the issue. We'll work on improving the error message and propagating the reason, however, depending on the nature of the failure that can be a bit tricky. If you are able to ssh into all of the nodes without providing a password, are you able to use the logs command to gather up all the Stardog logs?