gravitational / gravity

Kubernetes application deployments for restricted, regulated, or remote environments
Apache License 2.0
1.08k stars 109 forks source link

[BUG] Gravity installation fails while populating docker registry #462

Closed itay closed 5 years ago

itay commented 5 years ago

Describe the bug Periodically, gravity install fails in the populate docker registry step. Here is the sample CLI output:

sudo ./gravity install --token=fIJyJQHiaC --advertise-addr=172.31.100.139 --cloud-provider=generic --config=cluster-config.yaml --pod-network-cidr="172.23.0.0/16" --service-cidr="172.34.0.0/16"
Tue Jun 25 04:46:15 UTC Starting installer
Tue Jun 25 04:46:15 UTC Preparing for installation...
Tue Jun 25 04:46:19 UTC Installing application odas:1.4.0-rc.2-gravity
Tue Jun 25 04:46:19 UTC Starting non-interactive install
Tue Jun 25 04:46:19 UTC Auto-loaded kernel module: br_netfilter
Tue Jun 25 04:46:19 UTC Auto-loaded kernel module: iptable_nat
Tue Jun 25 04:46:19 UTC Auto-loaded kernel module: iptable_filter
Tue Jun 25 04:46:19 UTC Auto-loaded kernel module: ebtables
Tue Jun 25 04:46:19 UTC Auto-loaded kernel module: overlay
Tue Jun 25 04:46:19 UTC Auto-set kernel parameter: net.ipv4.ip_forward=1
Tue Jun 25 04:46:19 UTC Auto-set kernel parameter: net.bridge.bridge-nf-call-iptables=1
Tue Jun 25 04:46:20 UTC Successfully added "master" node on 172.31.100.139
Tue Jun 25 04:46:20 UTC All agents have connected!
Tue Jun 25 04:46:20 UTC Starting the installation
Tue Jun 25 04:46:21 UTC Operation has been created
Tue Jun 25 04:46:22 UTC Execute preflight checks
Tue Jun 25 04:46:24 UTC Configure packages for all nodes
Tue Jun 25 04:46:27 UTC Bootstrap master node ip-172-31-100-139.us-west-2.compute.internal
Tue Jun 25 04:46:31 UTC Pull packages on master node ip-172-31-100-139.us-west-2.compute.internal
Tue Jun 25 04:47:06 UTC Install system package teleport:3.0.5 on master node ip-172-31-100-139.us-west-2.compute.internal
Tue Jun 25 04:47:08 UTC Install system package odas-planet:1.4.0-rc.2-gravity on master node ip-172-31-100-139.us-west-2.compute.internal
Tue Jun 25 04:47:23 UTC Wait for kubernetes to become available
Tue Jun 25 04:47:40 UTC Bootstrap Kubernetes roles and PSPs
Tue Jun 25 04:47:42 UTC Configure CoreDNS
Tue Jun 25 04:47:43 UTC Create user-supplied Kubernetes resources
Tue Jun 25 04:47:45 UTC Populate Docker registry on master node ip-172-31-100-139.us-west-2.compute.internal
Tue Jun 25 04:47:46 UTC Operation failure: failed to connect to registry at "127.0.0.1:5000", failed to execute phase "/export/ip-172-31-100-139.us-west-2.compute.internal"
Tue Jun 25 04:47:46 UTC Installation failed in 1m26.39374357s, check /var/log/gravity-install.log and /var/log/gravity-system.log for details

This is running on EC2 instance, stock Amazon Linux 2 AMI, with 120GB disk. Here is the cluster-config.yaml file:

kind: ClusterConfiguration
version: v1
spec:
  global:
    # port range to reserve for services with NodePort visibility
    serviceNodePortRange: "1025-65535"
---
kind: user
version: v2
metadata:
  name: "admin"
spec:
  type: "admin"
  password: "fIJyJQHiaC"
  roles: ["@teleadmin"]
---
kind: RuntimeEnvironment
version: v1
spec:
  data:
    KUBE_KUBELET_FLAGS: "--sync-frequency=10s"

I've included the log files for gravity-install, gravity-system and journalctl from within the Planet container, as well as the one from the actual system (journalctl_system.log).

To Reproduce

Nothing really - I run the exact same steps (this is more or less scripted for us), and it succeeds 19 times out of 20. Typically I can just re-run the set up (after running ./gravity leave --force to clean up), and it just works on the same machine.

Expected behavior

It should work every time.

Logs

gravity-install.log gravity-system.log journalctl.log journalctl_system.log

Environment (please complete the following information):

Additional context

knisbet commented 5 years ago

https://github.com/gravitational/planet/pull/433 Should address this issue.

It seems that the installer is able to race the startup of the registry, and sometimes the registry isn't starting immediately. On my system this happened when docker took a long time to start. So as a workaround I removed the dependency that the registry has on docker starting. Although this should also be addressed in the installer itself.

itay commented 5 years ago

That’s great. Do you think it’ll be ported to the 5.5.x line as well?

knisbet commented 5 years ago

Yea, since it looks like it's happening on older versions, I'll backport it.

itay commented 5 years ago

Any updates on this? We run into this pretty deterministically on larger instances with more CPUs.

knisbet commented 5 years ago

The workaround has been backported to the active maintenance branches, so it's just a matter of checking in with the team. I can try and push out the release tomorrow if there's nothing else pending.

itay commented 5 years ago

Thanks - it seems like the only commit pending in the 5.5.x branch. If there's a published release of Planet somewhere with it, I am happy to give it a shot and see if it fixes the problem prior to cutting the release.

knisbet commented 5 years ago

No I don't think planet has been tagged, and I'm pretty sure the workaround appears to be working on the 6.0 branch. It is possible to just build planet, and include it in an application using the custom base image in gravity: https://gravitational.com/gravity/docs/pack/#user-defined-base-image

knisbet commented 5 years ago

@itay sorry for the delay, I got kind of swamped yesterday. I've released 5.6.5 / 5.5.13 which have the workaround for this issue.

itay commented 5 years ago

Excellent - we will test it. Thank you so much!