installation - Githubissues

pgte commented 9 years ago

I'm going to do an installation and try to go by the book instead of trying to guess so that the onboarding of new developers can get easier.

pgte commented 9 years ago

Ran into this problem when doing ./scripts/install-vagrant.sh:

Installing Paz on Vagrant
Please install etcdctl. Aborting.

lukebond commented 9 years ago

Perfect, will update the README.

pgte commented 9 years ago

More progress, now reports 2 failed units. Here is the tail of the output:

Starting paz runlevel 1 units
+ fleetctl -strict-host-key-checking=false start unitfiles/1/paz-orchestrator-announce.service unitfiles/1/paz-orchestrator.service unitfiles/1/paz-scheduler-announce.service unitfiles/1/paz-scheduler.service unitfiles/1/paz-service-directory-announce.service unitfiles/1/paz-service-directory.service
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
Unit paz-service-directory.service launched
Unit paz-orchestrator.service launched
Unit paz-scheduler.service launched on bef73231.../172.17.8.101
Unit paz-scheduler-announce.service launched on bef73231.../172.17.8.101
Unit paz-orchestrator-announce.service launched on 09938dfe.../172.17.8.102
Unit paz-service-directory-announce.service launched on f37795e5.../172.17.8.103
+ echo Successfully started all runlevel 1 paz units on the cluster with Fleet
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 2 | Active: 2 | Failed: 2...
Failed unit detected

Any hints on how to debug this?

lukebond commented 9 years ago

Some debugging tips:

Which units are failing?

$ fleetctl --endpoint=http://172.17.8.101:4001 list-units
UNIT                                    MACHINE                     ACTIVE      SUB
paz-orchestrator-announce.service       4e4038bb.../172.17.8.103    inactive    dead
paz-orchestrator.service                4e4038bb.../172.17.8.103    failed      failed
paz-scheduler-announce.service          7a70d1e8.../172.17.8.101    inactive    dead
paz-scheduler.service                   7a70d1e8.../172.17.8.101    failed      failed
paz-service-directory-announce.service  43049642.../172.17.8.102    inactive    dead
paz-service-directory.service           43049642.../172.17.8.102    failed      failed

Viewing the logs of a failed service:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator

(add -f to follow logs)

SSH into the machine:

$ cd coreos-vagrant
$ vagrant ssh core-0[1,2,3]

View system logs (after SSHing):

$ journalctl

I'm getting the same issue you are atm, so will be spending time debugging it this weekend. Using the alpha channel of CoreOS means things sometimes things change between releases.

lukebond commented 9 years ago

Another tip: When viewing the journal for a service, if you see an HTTP 403 from Docker then check your quay.io credential environment variables as described in the README.

lukebond commented 9 years ago

@pgte try again now, it's working for me after making a few fixes.

pgte commented 9 years ago

A bit more progress, but still failing for me. By the log it looks like I may need access to some quay.io repos:

$ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 10:36:02 UTC, end at Wed 2015-02-11 10:37:56 UTC. --
Feb 11 10:36:38 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 10:36:38 core-02 docker[993]: WARNING: Invalid auth configuration file
Feb 11 10:36:41 core-02 docker[993]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 10:36:43 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 10:36:43 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 10:36:43 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 10:36:43 core-02 docker[993]: time="2015-02-11T10:36:43Z" level="fatal" msg="HTTP code: 403"

lukebond commented 9 years ago

A 403 suggests missing or incorrect quay.io credentials. In the installation section of the README there is a recent addition stating that it can now read credentials from your ~/.dockercfg file. Do docker login https://quay.io and enter your quay.io credentials and then try installation again. It should take your creds from ~/.dockercfg and put it on each VM.

pgte commented 9 years ago

Downloaded .dockercfg from quay.io and installed it in ~/.dockercfg.

→ cat /Users/pedroteixeira/.dockercfg
{
 "quay.io": {
  "auth": "XXX",
  "email": "i@pgte.me"
 }
}

looks ok. But now, when I run the installation script I get:

→ scripts/install-vagrant.sh
Installing Paz on Vagrant
Attempt to autoload Docker config from /Users/pedroteixeira/.dockercfg FAILED
You must set the $DOCKER_AUTH environment variable

lukebond commented 9 years ago

the registry key "quay.io" needs to be "https://quay.io" at the moment. I'll open an issue for this as it's too brittle and should work with or without the protocol.

lukebond commented 9 years ago

Created issue #7 for this.

pgte commented 9 years ago

That fixed the reading of the file. Also, I was getting 403 because of not belonging to the org (Github org staff doesn't apply to quay.io). Perhaps document this fact somewhere?

pgte commented 9 years ago

Hmmm... now I get a 500. Here is the log for the orchestrator:

→ fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal paz-orchestrator
####################################################################
WARNING: fleetctl (0.8.3) is older than the latest registered
version of fleet found in the cluster (0.9.0). You are strongly
recommended to upgrade fleetctl to prevent incompatibility issues.
####################################################################
-- Logs begin at Wed 2015-02-11 13:00:00 UTC, end at Wed 2015-02-11 13:01:43 UTC. --
Feb 11 13:00:41 core-02 docker[1054]: time="2015-02-11T13:00:41Z" level="fatal" msg="HTTP code: 500"
Feb 11 13:00:41 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:41 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:41 core-02 systemd[1]: Starting paz-orchestrator: Main API for all paz services and monitor of services in etcd....
Feb 11 13:00:45 core-02 docker[1105]: Pulling repository quay.io/yldio/paz-orchestrator
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service: control process exited, code=exited status=1
Feb 11 13:00:46 core-02 systemd[1]: Failed to start paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 11 13:00:46 core-02 systemd[1]: Unit paz-orchestrator.service entered failed state.
Feb 11 13:00:46 core-02 systemd[1]: paz-orchestrator.service failed.
Feb 11 13:00:46 core-02 docker[1105]: time="2015-02-11T13:00:46Z" level="fatal" msg="HTTP code: 500"

lukebond commented 9 years ago

Hmm not very enlightening. Could you post some logs from the host around that time using journalctl please?

lukebond commented 9 years ago

Any luck with with @pgte? Can you confirm if you were running the integration test script or install-vagrant?

Confirmed working on ArchLinux \o/

No9 commented 9 years ago

Had a dive into this over the weekend. Ran into an issue where is was getting timeouts when logging into the quay.io server when running

$ sudo docker login https://quay.io

FATA[0036] Error Response from daemon v1 ping attempt failed with error: Get https://quay.io/v1/ping: dail tcp: i/o timeout

The error number FATA[0036] could change. Confirmed by quay.io as a problem their side with route53.

Workaround was to put an entry in to hosts after finding out where quay.io resolved to. N.B. ping is blocked so I used wget

 wget quay.io
--2015-03-02 23:57:25--  http://quay.io/
Resolving quay.io (quay.io)... 184.73.156.14, 50.17.243.21, 54.243.34.28, ...
Connecting to quay.io (quay.io)|184.73.156.14|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://quay.io/ [following]

So I Put the entry

184.73.156.14 quay.io

Into my /etc/hosts file and login was fine

tomgco commented 9 years ago

Hey @No9, thanks for having a look and reporting this, it shouldn't be necessary to log into quay.io any more, however we are looking to deploy to https://registry.hub.docker.com as well, issue #23.

I tried to replicate your login issue however it was successful for me, if anyone else has any problems to this then we can add a notice in the README.

No9 commented 9 years ago

Thanks @tomgco FYI I think this is the line that was printing the message if quay.io wasn't logged into https://github.com/yldio/paz/blob/master/scripts/helpers.sh#L9

lukebond commented 9 years ago

Looks like it's time to just remove all that Docker auth stuff from the installation process. It's probably silently working for those of us who still have the credentials in our ~/.dockercfg and failing for those who don't. It's no longer needed since the Docker repos are now public and won't become private again.

Created #27

twilson63 commented 9 years ago

Thanks for Paz, looking forward to playing with it, I tried to install via vagrant:

How long should paz take to install via vagrant install?

Starting paz runlevel 1 units
Unit paz-scheduler.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator-announce.service launched on 23965b52.../172.17.8.103
Unit paz-service-directory.service launched on f441edc7.../172.17.8.101
Unit paz-service-directory-announce.service launched on f441edc7.../172.17.8.101
Unit paz-scheduler-announce.service launched on 257b40cd.../172.17.8.102
Unit paz-orchestrator.service launched on 23965b52.../172.17.8.103
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 6 | Active: 0 | Failed: 0...

Any ideas, what I might be doing wrong?

lukebond commented 9 years ago

@twilson63 thanks for taking it for a spin!

There is no error in what you're seeing here, but the next step will take a while. It has started the units on the cluster but "starting" involves pulling the Docker images before running them. The base images (usually Ubuntu) are quite big and will take a while. If the units are evenly distributed across the cluster by Fleet then each host in your cluster will be pulling the same base images. Not ideal and it takes a while.

The Activating/Active/Failed file is using grep and awk on the output of fleetctl list-units in your cluster. Once they all say "Active" it will be finished.

If anything goes wrong at this point please use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 list-units to see what has failed, and use fleetctl -strict-host-key-checking=false --endpoint=http://172.17.8.101:4001 journal <SERVICENAME> to see the logs for a given service.

twilson63 commented 9 years ago

Great! I think everything is running, but I can't seem to access any of the ip address. I have very limited use with vagrant, once everything is up, should I be able to access the web service by opening a browser to the ip http://172.17.8.101/

Thanks for the help!

lukebond commented 9 years ago

If you've done the /etc/hosts step you should be able to hit the Web UI at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing on port 80 but HAProxy, and that is configured to check for the service you want (the prefix in front of .paz) and forward it onto the right service. (If you're interested it also does a similar thing internally, forwarding purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz" doesn't route anywhere on the internet you need to do the /etc/hosts hack. I appreciate that none of this is obvious at the moment given the current state of the docs.

twilson63 commented 9 years ago

Cool,

I think I fubared something: I will try again:

On Wed, Mar 4, 2015 at 7:13 AM, Luke Bond notifications@github.com wrote:

If you've done the /etc/hosts step you should be able to hit the Web UI at http://paz-web.paz

The services are all exposed on random ports by docker so there's nothing on port 80 but HAProxy, and that is configured to check for the service you want (the prefix in front of .paz) and forward it onto the right service. (If you're interested it also does a similar thing internally, forwarding purely by service name, e.g. "paz-scheduler"). So since "paz-web.paz" doesn't route anywhere on the internet you need to do the /etc/hosts hack. I appreciate that none of this is obvious at the moment given the current state of the docs.

— Reply to this email directly or view it on GitHub https://github.com/yldio/paz/issues/2#issuecomment-77147539.

Tom Wilson Jack Russell Software Company Division of CareKinesis 494 Wando Park Blvd Mount Pleasant, SC 29464 Phone: 843-606-6484 Mobile: 843-469-5856 Email: tom@jackhq.com Web: http://www.jackhq.com Calendar: http://www.google.com/calendar/embed?src=tom%40jackrussellsoftware.com&ctz=America/New_York http://www.jackhq.com/calendar

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure by the Health Insurance Portability and Accountability Act (HIPAA) and other state and federal laws. This information is intended only for the individual names above. Any review, use disclosure or dissemination of this material is strictly prohibited. If you receive this information in error, please notify CareKinesis immediately at 888-974-2763 and delete the original at once.

lukebond commented 9 years ago

A lot has changed since this issue was opened and it now spans a few people different issues. Going to close it and please open others with updates. Thanks all for the contributions!

paz-sh / paz

installation #2

Which units are failing?

Viewing the logs of a failed service:

SSH into the machine:

View system logs (after SSHing):