Closed ryansch closed 8 years ago
Also the ecs-agent
service isn't even created under v0.5.0 for some reason. I've given up on troubleshooting after the level of brokenness I've encountered. I've managed to boot my cluster once out of 5 attempts. It works every time on v0.4.5.
I'm pretty sure the issue here is the io.rancher.os.after=wait-for-network
label. The wait-for-network
service was removed in v0.5.0 and from what I recall adding a dependency to a non-existent service can have effects similar to what you've described. Would you mind trying after changing the labels to io.rancher.os.after=network
?
Based on the conversation in #1104 I've moved all of my services to user docker. I've also stopped relying on subnets by setting up instance-local dns. Additionally I found a couple subtle bugs in my service setup that v0.4.5 just ignored but v0.5.0 choked on. As of this morning I have a fully booted cluster on v0.5.0.
@joshwget Since I'm running in user docker I removed the dependency on wait-for-network
.
For posterity this is what I ended up with:
master:
#cloud-config
write_files:
- path: /opt/rancher/bin/start.sh
permissions: "0700"
owner: root:root
content: |
#!/bin/bash
INSTANCE_IP=$(wget -O - -q http://169.254.169.254/latest/meta-data/local-ipv4)
iptables -t nat -A PREROUTING -p tcp -d 127.0.0.1 --dport 8400 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A PREROUTING -p tcp -d 127.0.0.1 --dport 8500 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A OUTPUT -o lo -p tcp -m tcp --dport 8400 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A OUTPUT -o lo -p tcp -m tcp --dport 8500 -j DNAT --to $${INSTANCE_IP}
rancher:
cloud_init:
datasources:
- ec2
registry_auths:
https://index.docker.io/v1/:
auth: ${docker_auth}
services:
consul-master:
image: consul:v0.6.4
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: agent -server -ui -dc=${consul_dc} -bootstrap-expect=${consul_bootstrap_count}
environment:
CONSUL_BIND_INTERFACE: eth0
CONSUL_CLIENT_INTERFACE: eth0
CONSUL_LOCAL_CONFIG: '{"skip_leave_on_interrupt": true}'
node-heartbeat:
image: outstand/node_heartbeat:0.1.1
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: start -b ${heartbeat_bucket}
consul_bridge:
image: outstand/consul_bridge:0.1.5
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: start -b ${heartbeat_bucket} -n consul-master -a
volumes:
- /var/run/docker.sock:/var/run/docker.sock
consul_stockpile:
image: outstand/consul_stockpile:0.1.2
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: start -b ${stockpile_bucket} -n ${stockpile_backup_name}
schmooze:
image: outstand/schmooze:latest
command: dns Name=dns Driver=bridge CheckDuplicate:=true IPAM:='{"Config":[{"Subnet":"10.10.10.0/24","IPRange":"10.10.10.1/30"}]}' Options:='{"com.docker.network.bridge.enable_icc":"true","com.docker.network.bridge.enable_ip_masquerade":"true","com.docker.network.bridge.name":"dns"}'
net: host
cap_add:
- NET_ADMIN
volumes:
- /var/run/docker.sock:/var/run/docker.sock
labels:
io.rancher.os.detach: "false"
dns:
image: outstand/selfish-dns:latest
restart: always
command: -A /consul/INSTANCE_IP
net: dns
cap_add:
- NET_ADMIN
labels:
io.rancher.os.after: schmooze
client:
#cloud-config
write_files:
- path: /opt/rancher/bin/start.sh
permissions: "0700"
owner: root:root
content: |
#!/bin/bash
INSTANCE_IP=$(wget -O - -q http://169.254.169.254/latest/meta-data/local-ipv4)
iptables -t nat -A PREROUTING -p tcp -d 127.0.0.1 --dport 8400 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A PREROUTING -p tcp -d 127.0.0.1 --dport 8500 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A OUTPUT -o lo -p tcp -m tcp --dport 8400 -j DNAT --to $${INSTANCE_IP}
iptables -t nat -A OUTPUT -o lo -p tcp -m tcp --dport 8500 -j DNAT --to $${INSTANCE_IP}
rancher:
cloud_init:
datasources:
- ec2
registry_auths:
https://index.docker.io/v1/:
auth: ${docker_auth}
services:
consul-client:
image: consul:v0.6.4
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: agent -ui -dc=${consul_dc}
environment:
CONSUL_BIND_INTERFACE: eth0
CONSUL_CLIENT_INTERFACE: eth0
CONSUL_LOCAL_CONFIG: '{"leave_on_terminate": true}'
consul_bridge:
image: outstand/consul_bridge:0.1.5
restart: always
labels:
- io.rancher.os.remove=false
net: host
command: start -b ${heartbeat_bucket} -n consul-client
volumes:
- /var/run/docker.sock:/var/run/docker.sock
registrator:
image: outstand/registrator:v7
labels:
- io.rancher.os.after=consul-client
restart: always
volumes:
- /var/run/docker.sock:/tmp/docker.sock
command: consul://127.0.0.1:8500
net: host
environment:
TAG_AWS: 'true'
schmooze:
image: outstand/schmooze:latest
command: dns Name=dns Driver=bridge CheckDuplicate:=true IPAM:='{"Config":[{"Subnet":"10.10.10.0/24","IPRange":"10.10.10.1/30"}]}' Options:='{"com.docker.network.bridge.enable_icc":"true","com.docker.network.bridge.enable_ip_masquerade":"true","com.docker.network.bridge.name":"dns"}'
net: host
cap_add:
- NET_ADMIN
volumes:
- /var/run/docker.sock:/var/run/docker.sock
labels:
io.rancher.os.detach: "false"
dns:
image: outstand/selfish-dns:latest
restart: always
command: -A /consul/INSTANCE_IP
net: dns
cap_add:
- NET_ADMIN
labels:
io.rancher.os.after: schmooze
wait-for-consul:
image: outstand/wait-for-consul:latest
labels:
io.rancher.os.after: consul-client,dns
io.rancher.os.detach: "false"
dns: 10.10.10.2
environment:
CONSUL_HOST: consul
ecs-agent:
image: amazon/amazon-ecs-agent:v1.10.0
labels:
io.rancher.os.after: wait-for-consul
restart: always
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /var/log/ecs/:/log
- /var/lib/ecs/data:/data
- /sys/fs/cgroup:/sys/fs/cgroup:ro
- /var/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro
ports:
- "127.0.0.1:51678:51678"
environment:
- ECS_LOGFILE=/log/ecs-agent.log
- ECS_LOGLEVEL=info
- ECS_DATADIR=/data
- ECS_RESERVED_MEMORY=400
- ECS_*
- AWS_*
environment:
ECS_CLUSTER: ${ecs_cluster}
ECS_ENGINE_AUTH_TYPE: dockercfg
ECS_ENGINE_AUTH_DATA: '{"https://index.docker.io/v1/":{"auth":"${docker_auth}","email":"${docker_email}"}}'
The schmooze
service sets up a docker network so that the dns
container will have the ip 10.10.10.2. This is a workaround until rancher supports the compose v2 format.
RancherOS Version: (ros os version) v0.5.0
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.) AWS
I'm using an auto-scaling group to launch these instances so I'm sure they're starting with the same config. With v0.4.5 I have all of my services start every time. With 0.5.0 I'll have services randomly fail to be created, fail to restart, and fail to start (from created) if I specify before/after with service labels.
Here are the templates I'm using for my master and client nodes: