Periodically failing to restart mini-lab

GrigoriyMikhalkin commented 3 years ago

After some time after first start of mini-lab i'm failing to restart it. I get very similar error, but at different stages(so far i see errors at deploy-partition | TASK [metal-roles/partition/roles/docker-on-cumulus : ensure dependencies are installed], deploy-partition | TASK [ansible-common/roles/systemd-docker-service : pre-pull docker image] and deploy-partition | TASK [metal-roles/partition/roles/metal-core : wait for metal-core to listen on port]). Here is last error that i got:

deploy-control-plane | fatal: [localhost]: FAILED! => changed=true 
deploy-control-plane |   cmd:
deploy-control-plane |   - helm
deploy-control-plane |   - upgrade
deploy-control-plane |   - --install
deploy-control-plane |   - --namespace
deploy-control-plane |   - metal-control-plane
deploy-control-plane |   - --debug
deploy-control-plane |   - --set
deploy-control-plane |   - helm_chart.config_hash=7fc19e1bc1a3ee41f622c3de7bc98ee33756844e
deploy-control-plane |   - -f
deploy-control-plane |   - metal-values.j2
deploy-control-plane |   - --repo
deploy-control-plane |   - https://helm.metal-stack.io
deploy-control-plane |   - --version
deploy-control-plane |   - 0.2.1
deploy-control-plane |   - --wait
deploy-control-plane |   - --timeout
deploy-control-plane |   - 600s
deploy-control-plane |   - metal-control-plane
deploy-control-plane |   - metal-control-plane
deploy-control-plane |   delta: '0:10:02.713685'
deploy-control-plane |   end: '2020-12-09 08:47:29.432729'
deploy-control-plane |   msg: non-zero return code
deploy-control-plane |   rc: 1
deploy-control-plane |   start: '2020-12-09 08:37:26.719044'
deploy-control-plane |   stderr: |-
deploy-control-plane |     history.go:53: [debug] getting history for release metal-control-plane
deploy-control-plane |     install.go:172: [debug] Original chart version: "0.2.1"
deploy-control-plane |     install.go:189: [debug] CHART PATH: /root/.cache/helm/repository/metal-control-plane-0.2.1.tgz
deploy-control-plane |   
deploy-control-plane |     client.go:255: [debug] Starting delete for "metal-api-initdb" Job
deploy-control-plane |     client.go:284: [debug] jobs.batch "metal-api-initdb" not found
deploy-control-plane |     client.go:109: [debug] creating 1 resource(s)
deploy-control-plane |     client.go:464: [debug] Watching for changes to Job metal-api-initdb with timeout of 10m0s
deploy-control-plane |     client.go:492: [debug] Add/Modify event for metal-api-initdb: ADDED
deploy-control-plane |     client.go:531: [debug] metal-api-initdb: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
deploy-control-plane |     client.go:492: [debug] Add/Modify event for metal-api-initdb: MODIFIED
deploy-control-plane |     client.go:531: [debug] metal-api-initdb: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
deploy-control-plane |     Error: failed pre-install: timed out waiting for the condition
deploy-control-plane |     helm.go:81: [debug] failed pre-install: timed out waiting for the condition
deploy-control-plane |   stderr_lines: <omitted>
deploy-control-plane |   stdout: Release "metal-control-plane" does not exist. Installing it now.
deploy-control-plane |   stdout_lines: <omitted>
deploy-control-plane | 
deploy-control-plane | PLAY RECAP *********************************************************************
deploy-control-plane | localhost                  : ok=24   changed=11   unreachable=0    failed=1    skipped=8    rescued=0    ignored=0

I'm using mini-lab on master branch with only change, metal_stack_release_version set to develop. Only thing that reliably helps is pruning everything(networks, build cache, containers, images) from docker.

OS: Ubuntu 20.04 Vagrant : 2.2.9 Docker:

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:01:20 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.9
  GitCommit:        ea765aba0d05254012b0b9e595e995c09186427f
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

cc @Gerrit91, @LimKianAn

Gerrit91 commented 3 years ago

Hi there,

I have difficulties to reproduce the behavior but also I haven't been working so much with the mini-lab in the last time, so you need to give me some more information. Tried out the following without any issues:

Two times in a row run make (idempotence check)
Run make machine and make firewall, both machines came up properly
Wait a couple of minutes
Run again make
Run make delete-machine01 and make delete-machine02
Both machines end up waiting again (PXE boot of one machine was slow though)

Encountered no issues to this point.

From the line

deploy-control-plane | client.go:531: [debug] metal-api-initdb: Jobs active: 1, jobs failed: 0, jobs succeeded: 0

I would assume that there is one pod in the metal-control-plane namespace that is hanging. Could you please check if there is a pod in the crashloop backoff when your problem occurs? I can imagine that it's the metal-api-createmasterdata-update job. Did you apply any changes to the OS images or something subsequently?

Also, we merged a small amount of changes last friday on the mini-lab, which fix the CI and release pipeline. Please try out the latest head as well.

Gerrit91 commented 3 years ago

Still having the issue or can this be closed?

LimKianAn commented 3 years ago

I still run into this issue. Can we have this part locally installed and provided by a local server?

Gerrit91 commented 3 years ago

Thanks for the quick showcase, @GrigoriyMikhalkin. We found out that the cause for this issue is a slow internet connection. During the deployment, many docker images are getting pulled and it can happen that hooks like api-initdb are timing out before things like essential databases have come up.

One possible mitigation could be to pull some images onto the local machine before the deployment and then kind load them into the Kind cluster (would prevent you pulling images again and again in case you delete the kind cluster often through make cleanup). Most of the pods use the IfNotPresent Pull Policy such that they won't be pulled again.

Would be good if we could let the user set the ImagePullPolicy globally in our deployments. For most development scenarios Always would be preferable, for production use-cases IfNotPresent is a better choice.

I filed issues for that:

You could also consider spinning up a local registry and use a replacement for the image vector, but that's probably quite some work and I am not sure if that's either intuitive or worth the effort.

I will close this issue down as I think that the mini-lab comes up properly if you just wait long enough for the images to download and potentially retry running make control-plane when you see pods running in the Kind cluster.

metal-stack / mini-lab

Periodically failing to restart mini-lab #51