Closed GrigoriyMikhalkin closed 3 years ago
Hi there,
I have difficulties to reproduce the behavior but also I haven't been working so much with the mini-lab in the last time, so you need to give me some more information. Tried out the following without any issues:
make
(idempotence check)make machine
and make firewall
, both machines came up properlymake
make delete-machine01
and make delete-machine02
Encountered no issues to this point.
From the line
deploy-control-plane | client.go:531: [debug] metal-api-initdb: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
I would assume that there is one pod in the metal-control-plane
namespace that is hanging. Could you please check if there is a pod in the crashloop backoff when your problem occurs? I can imagine that it's the metal-api-createmasterdata-update
job. Did you apply any changes to the OS images or something subsequently?
Also, we merged a small amount of changes last friday on the mini-lab, which fix the CI and release pipeline. Please try out the latest head as well.
Still having the issue or can this be closed?
I still run into this issue. Can we have this part locally installed and provided by a local server?
Thanks for the quick showcase, @GrigoriyMikhalkin. We found out that the cause for this issue is a slow internet connection. During the deployment, many docker images are getting pulled and it can happen that hooks like api-initdb
are timing out before things like essential databases have come up.
One possible mitigation could be to pull some images onto the local machine before the deployment and then kind load
them into the Kind cluster (would prevent you pulling images again and again in case you delete the kind cluster often through make cleanup
). Most of the pods use the IfNotPresent
Pull Policy such that they won't be pulled again.
Would be good if we could let the user set the ImagePullPolicy
globally in our deployments. For most development scenarios Always
would be preferable, for production use-cases IfNotPresent
is a better choice.
I filed issues for that:
You could also consider spinning up a local registry and use a replacement for the image vector, but that's probably quite some work and I am not sure if that's either intuitive or worth the effort.
I will close this issue down as I think that the mini-lab comes up properly if you just wait long enough for the images to download and potentially retry running make control-plane
when you see pods running in the Kind cluster.
After some time after first start of
mini-lab
i'm failing to restart it. I get very similar error, but at different stages(so far i see errors atdeploy-partition | TASK [metal-roles/partition/roles/docker-on-cumulus : ensure dependencies are installed]
,deploy-partition | TASK [ansible-common/roles/systemd-docker-service : pre-pull docker image]
anddeploy-partition | TASK [metal-roles/partition/roles/metal-core : wait for metal-core to listen on port]
). Here is last error that i got:I'm using
mini-lab
on master branch with only change,metal_stack_release_version
set todevelop
. Only thing that reliably helps is pruning everything(networks, build cache, containers, images) from docker.OS: Ubuntu 20.04 Vagrant : 2.2.9 Docker:
cc @Gerrit91, @LimKianAn