vexxhost / atmosphere

Simple & easy private cloud platform featuring VMs, Kubernetes & bare-metal
88 stars 23 forks source link

CLI occasionally fails #1291

Closed mnaser closed 2 months ago

mnaser commented 3 months ago
failed: [instance] (item={'role': 'member', 'implies': 'load-balancer_member'}) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o posix\nsource /etc/profile.d/atmosphere.sh\nopenstack implied role create  --implied-role load-balancer_member  member\n", "delta": "0:00:00.315367", "end": "2024-05-31 23:25:51.282635", "failed_when_result": true, "item": {"implies": "load-balancer_member", "role": "member"}, "msg": "non-zero return code", "rc": 1, "start": "2024-05-31 23:25:50.967268", "stderr": "time=\"2024-05-31T23:25:51Z\" level=fatal msg=\"container \\\"5975025a5f62c76e1f95be14ee258795344637bd0c08a386d35a41cb212ee6a3\\\" in namespace \\\"k8s.io\\\": not found\"", "stderr_lines": ["time=\"2024-05-31T23:25:51Z\" level=fatal msg=\"container \\\"5975025a5f62c76e1f95be14ee258795344637bd0c08a386d35a41cb212ee6a3\\\" in namespace \\\"k8s.io\\\": not found\""], "stdout": "", "stdout_lines": []}

I also noticed this when using the CLI on my side. Do you have any idea what this can be from @fitbeard ?

fitbeard commented 3 months ago

We are setting namespace here: https://github.com/vexxhost/atmosphere/blob/main/roles/openstack_cli/meta/main.yml#L35 Nerdctl expects this namespace and if kubernetes are not boostrapped this namespace is not there. Namespace can also be unset but then image layers pulled by containerd for kube will not be used because of different namespace. Did I answered to your question @mnaser ?

fitbeard commented 3 months ago

Personally I never had similar issue

mnaser commented 3 months ago

@fitbeard interesting, I see that error and the next run/refresh it'll be fine, so I wonder what's causing it. It's not an always thing but it seems to happen every now and then. 🤔

mnaser commented 3 months ago

CI failed here randomly again too on this:

https://ci.atmosphere.dev/t/atmosphere/build/a6d054b47ff74ed2baf329ff627d9be8

fitbeard commented 3 months ago

Interesting why mentioned container is missing. Maybe containerd socket is killed earlier?

mnaser commented 3 months ago

Collecting more here:

https://ci.atmosphere.dev/t/atmosphere/build/054c124a30c6459199de2aaf2bb7bafa

We do grab the kubelet logs, I do wonder if kubelet its somehow racing and killing it at the same time or something.

mnaser commented 2 months ago

This is happening more and more, I'm going to have to introduce retries on the Ansible tasks to avoid the whole CI job to fail, but examples:

https://ci.atmosphere.dev/t/atmosphere/build/2069321a2af54c7ebf6b6cc5d6493474 https://ci.atmosphere.dev/t/atmosphere/build/ef2d1b175d294bf08573329ac814c14f