paz-sh / paz

An open-source, in-house service platform with a PaaS-like workflow, built on Docker, CoreOS, Etcd and Fleet. This repository houses the documentation and installation scripts.
http://paz.sh
Other
1.08k stars 56 forks source link

Slow announcer for orchestrator #15

Open tomgco opened 9 years ago

tomgco commented 9 years ago
$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator-announce.service
Feb 26 17:38:59 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:00 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:01 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:02 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:03 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:04 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:05 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:06 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:07 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:08 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory

However this timed out and then:

Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service start-pre operation timed out. Terminating.
Feb 26 17:41:31 core-01 systemd[1]: Failed to start paz-orchestrator announce.
Feb 26 17:41:31 core-01 systemd[1]: Unit paz-orchestrator-announce.service entered failed state.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service failed.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service holdoff time over, scheduling restart.
Feb 26 17:41:31 core-01 systemd[1]: Stopping paz-orchestrator announce...
Feb 26 17:41:31 core-01 systemd[1]: Starting paz-orchestrator announce...
Feb 26 17:41:31 core-01 sh[26833]: Waiting for 49153/tcp...
Feb 26 17:41:31 core-01 systemd[1]: Started paz-orchestrator announce.
Feb 26 17:41:31 core-01 sh[26857]: Connected to 172.17.8.101:49153/tcp and 172.17.8.101:49154, publishing to etcd..

However the service seemed to be up and running:

$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator.service
-- Logs begin at Thu 2015-02-26 16:40:27 UTC, end at Thu 2015-02-26 17:44:04 UTC. --
Feb 26 17:05:35 core-01 systemd[1]: Started paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 26 17:05:37 core-01 bash[16821]: {}
Feb 26 17:05:37 core-01 bash[16821]: { disabled: 'true',
Feb 26 17:05:37 core-01 bash[16821]: provider: 'dnsimple',
Feb 26 17:05:37 core-01 bash[16821]: email: 'test@example.com',
Feb 26 17:05:37 core-01 bash[16821]: apiKey: '312487532487',
Feb 26 17:05:37 core-01 bash[16821]: domain: 'paz' }
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"Starting server","time":"2015-02-26T17:05:37.887Z","src":{"file":"/usr/src/app/server.js","line":205},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"paz-orchestrator is now running on port 9000","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":194},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"svcdir-url":"http://paz-service-directory.paz","msg":"","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":195},"v":0}
tomgco commented 9 years ago

Here is the start of the announcer:

Feb 26 16:41:31 core-01 systemd[1]: Starting paz-orchestrator announce...
Feb 26 17:05:36 core-01 sh[1195]: Waiting for 49153 HostIp:0.0.0.0/tcp...
Feb 26 17:05:36 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory

So it means that the following code is causing the above error message: me!

  until netstat -lnt | grep :$port >/dev/null; \
      do sleep 1; \
  done"
tomgco commented 9 years ago

So all I can find is that we have a discrepancy when trying to sed for the bound ports when using docker inspect.

How it should look:

Waiting for 49153/tcp...

What it is currently:

Waiting for 49153 HostIp:0.0.0.0/tcp...

My guess is something funky is going on with $port / a race condition in the bootstrapping of the container. https://github.com/yldio/paz/blob/d53997470d5263a2334a16b3835380a8d849dd22/unitfiles/1/paz-orchestrator-announce.service#L19-L21

  port=$(docker inspect -f '{{ index .NetworkSettings.Ports \"9000/tcp\"}}' paz-orchestrator \
    | sed 's/.*Port://' \
    | sed 's/].+*//'); \

Maybe @sublimino might be able to offer some input into robustifying the shell code to bail out if this condition occurs.

lukebond commented 9 years ago

@tomgco thanks for all the details! i'll be taking a look at this on the weekend.

lukebond commented 9 years ago

I've seen this before when units fail to start up correctly, probably, as you identify @tomgco, when things don't start in the right order. This is an issue in and of itself, but one I haven't any insight into atm.

In the meantime, if this bash code can be made more robust then perhaps we may not see it again.

Got any ideas, @sublimino?

sublimino commented 9 years ago

I think adding dependencies to the units and leveraging systemd's native ordering is the first port of call, and I'll have a glance at the BASH scripts when I'm at a computer tomorrow. On 7 Mar 2015 19:39, "Luke Bond" notifications@github.com wrote:

I've seen this before when units fail to start up correctly, probably, as you identify @tomgco https://github.com/tomgco, when things don't start in the right order. This is an issue in and of itself, but one I haven't any insight into atm.

In the meantime, if this bash code can be made more robust then perhaps we may not see it again.

Got any ideas, @sublimino https://github.com/sublimino?

— Reply to this email directly or view it on GitHub https://github.com/paz-sh/paz/issues/15#issuecomment-77706016.

lukebond commented 9 years ago

:+1:

lukebond commented 9 years ago

above @sublimino is referring to #30 btw

tomgco commented 9 years ago

This cropped up again when provisioning a machine on Digital Ocean on paz-orchestrator-announce.service, maybe something is missing from the unit file?

tomgco commented 9 years ago

A fleetctl stop and start on paz-orchestrator-announce.service fixed this.

lukebond commented 9 years ago

Yeah I've seen this I've seen this quite a bit this weekend :/