can't take backup - Githubissues

fahadshery commented 5 months ago

I deployed exactly the same way as your guide...nothing changed...

I just tried to follow your backup guide and I am getting this error and there is no backup generated:

{"level":"info","ts":"2024-04-13T17:41:52Z","logger":"logging_event_handler","msg":"[playbook task start]","name":"awx","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"4431001951063708771","EventData.Name":"Verify imagePullSecrets"}
--------------------------- Ansible Task StdOut -------------------------------

TASK [Verify imagePullSecrets] *************************************************
task path: /opt/ansible/playbooks/awx.yml:10

-------------------------------------------------------------------------------
E0413 17:42:03.622973       7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Get "https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/awx/leases/awx-operator": context deadline exceeded
I0413 17:42:04.132972       7 leaderelection.go:285] failed to renew lease awx/awx-operator: timed out waiting for the condition
{"level":"info","ts":"2024-04-13T17:42:05Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-04-13T17:42:06Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-04-13T17:42:07Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxbackup-controller"}
{"level":"info","ts":"2024-04-13T17:42:08Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxrestore-controller"}
{"level":"info","ts":"2024-04-13T17:42:08Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxmeshingress-controller"}
{"level":"info","ts":"2024-04-13T17:42:09Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-04-13T17:42:08Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awx-controller"}
{"level":"info","ts":"2024-04-13T17:42:09Z","msg":"All workers finished","controller":"awxrestore-controller"}
{"level":"info","ts":"2024-04-13T17:42:10Z","msg":"All workers finished","controller":"awxmeshingress-controller"}
{"level":"info","ts":"2024-04-13T17:42:13Z","msg":"Stopping and waiting for webhooks"}
{"level":"error","ts":"2024-04-13T17:42:05Z","logger":"cmd","msg":"Proxy or operator exited with error.","error":"leader election lost","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.run\n\tansible-operator-plugins/internal/cmd/ansible-operator/run/cmd.go:261\ngithub.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.NewCmd.func1\n\tansible-operator-plugins/internal/cmd/ansible-operator/run/cmd.go:81\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tansible-operator-plugins/cmd/ansible-operator/main.go:40\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.12/x64/src/runtime/proc.go:250"}

kurokobo commented 5 months ago

@fahadshery According to your logs, it seems to be not Operator issue, but K3s issue. Your Operator suddenly shutted down for some reason e.g. high load on K3s host. Is your cluster healthy? Could you please provide output from following commands?

kubectl -n awx get deployment,pod
kubectl -n awx describe pod <operator pod name>

fahadshery commented 5 months ago

you were right. I increased the resources a little and it's not crashing and I do have the backup created successfully following your guide...

thank you so much for your hard work!

kurokobo / awx-on-k3s

can't take backup #342