Closed nicosalvadore closed 1 month ago
My plan was to delete the resources and create them back while keeping the data stored in /data/postgres-15
.
I tried to delete the k8s resoureces by running kubectl -n delete ns awx
, but the CLI hanged and resources (mainly pods) were stuck in a Terminating
state.
So i rolled back to a previous VM snapshot, and then tried to kubectl delete -k base
.
Some resources were deleted, but still the command hanged. After a Ctrl-C, I could still see the pods stuck in a Terminating
state.
So I ran the following commands :
kubectl delete pod/awx-task-9b6dcc459-4sfbm --grace-period=0 --force --namespace awx
kubectl delete pod/awx-web-66cfcc4f8c-nhg9k --grace-period=0 --force --namespace awx
kubectl delete pod/awx-postgres-15-0 --grace-period=0 --force --namespace awx
kubectl delete --grace-period=0 --force --namespace awx pod/awx-operator-controller-manager-7bd778dbbc-cnt2q
kubectl -n awx delete replicaset.apps/awx-operator-controller-manager-775bd7b75d
kubectl -n awx delete replicaset.apps/awx-operator-controller-manager-9874d5cfc
And then kubectl apply -k base
.
A few seconds later, all expected pods were running and the AWX UI+API was up.
To be honest I'm still not sure what happened, but it looks like it's solved.
Any idea ? Thanks !
@nicosalvadore Thanks for the report and for digging deeper into the details, it really helps me understand the situation better.
From what you shared, it seems more like the DB is frozen in a 'Terminating' state rather than just not starting up.
I think the backup by Veeam was probably taken while the VM was running. It seems like the backup's integrity could be in a crash-consistent, which might have led to data inconsistencies after restoring since the integrity for the internal data for K3s (etcd) wasn’t guaranteed.
So, I agree that forcefully deleting the resources stuck in 'Terminating' and redeploying is definitely the right approach.
Just one thing to double-check: are the credentials stored in AWX working correctly?
If kubectl delete ns awx
ended up deleting the awx-secret-key
secret in the awx
namespace, it might have been recreated by the AWX Operator, which could mean you can’t decrypt any sensitive info like credentials anymore.
Hi @kurokobo !
Thanks a lot for your answer. You might be correct about the frozen while terminating state. Still wondering about why though. Because you're right that the Veeam backup was done while the VM was running, which could have caused the issue. But it's kinda strange that the same issue occurred after migrating/converting the VM to Nutanix's hypervisor. Because in this case, VM snapshots are used to convert the VM. And I used snapshots often on this k3s VM when doing AWX operator upgrades.
It's possible that taking the backup itself caused the issue on the live VM, and that the services were down from that moment on, and thus were down too while migrating from vSphere to Nutanix. I admit I didn't check if AWX was up before starting the migration process. So it might just be bad luck, who knows...
The credentials are working correctly, yes ! I believe it's because I defined my own in kustomization.yaml
.
Nevertheless, this issue has been a good learning exercise on AWX and k8s 😛
@nicosalvadore Thanks for updating!
It’s definitely a strange situation, but if you keep forcefully powering off the virtual machine, there might be times when it breaks and times when it doesn’t, so if bad luck strikes, something like this could happen. Nobody really knows...
The credentials are working correctly, yes ! I believe it's because I defined my own in
kustomization.yaml
.
Great! I'm relieved to hear that.
Nevertheless, this issue has been a good learning exercise on AWX and k8s 😛
Troubleshooting is always a great source of learning, especially when we can take our time with it. I also use the tons of questions I get from everyone as a way to learn myself, so thank you for sharing your trouble with me 😃
I’ll close this issue, but feel free to reach out if you need anything else.
Environment
Description
My AWX deployment was running without issue for months, including updating it with awx operator.
Yesterday, I tried to migrate the VM from VMware vSphere to a Nutanix AHV (KVM) cluster with their tool Nutanix Move. The VM was migrated in a few minutes, but I noticed later that the AWX services were not coming back up. So I started the VM back on vSphere. But no luck either, so I restored from a Veeam backup that was taken the night before. But it's still not working. I restored from the backup because I knew the Nutanix tool to migrate the VM was installing drivers, so it might be the cause of the issue, but it doesn't look like it is.
Observations
404 page not found
.Logs
Files
I'm stuck in my troubleshooting steps, not knowing why the database is not made available to other pods/containers. Thanks in advance for your help !