postgres db seems unavailable

nicosalvadore commented 1 month ago

Environment

k3s version v1.28.5+k3s1 (5b2d1271)
go version go1.20.12
Rocky Linux 9.3 (4vCPU, 8GB RAM), as a vM on vSphere
AWX Operator 2.17.0

Description

My AWX deployment was running without issue for months, including updating it with awx operator.

Yesterday, I tried to migrate the VM from VMware vSphere to a Nutanix AHV (KVM) cluster with their tool Nutanix Move. The VM was migrated in a few minutes, but I noticed later that the AWX services were not coming back up. So I started the VM back on vSphere. But no luck either, so I restored from a Veeam backup that was taken the night before. But it's still not working. I restored from the backup because I knew the Nutanix tool to migrate the VM was installing drivers, so it might be the cause of the issue, but it doesn't look like it is.

Observations

When loading the AWX web page, I get a 404 page not found.
The postgres container is not starting as far as I understand.

Logs

$ kubectl -n awx get pods,deployments,statefulsets,services
NAME                                                   READY   STATUS             RESTARTS        AGE
pod/awx-operator-controller-manager-7bd778dbbc-cnt2q   2/2     Terminating        0               113d
pod/awx-task-9b6dcc459-4sfbm                           4/4     Terminating        0               113d
pod/awx-web-66cfcc4f8c-nhg9k                           3/3     Terminating        0               113d
pod/awx-postgres-15-0                                  1/1     Terminating        0               152d
pod/awx-task-9b6dcc459-6cp74                           0/4     Init:0/3           1               14h
pod/awx-operator-controller-manager-7bd778dbbc-gx86h   2/2     Running            2 (38m ago)     14h
pod/awx-web-66cfcc4f8c-mvqpz                           2/3     CrashLoopBackOff   168 (99s ago)   14h

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-task                          0/1     1            0           236d
deployment.apps/awx-operator-controller-manager   1/1     1            1           237d
deployment.apps/awx-web                           0/1     1            0           236d

NAME                               READY   AGE
statefulset.apps/awx-postgres-15   0/1     152d

NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.248.176   <none>        8443/TCP   237d
service/awx-postgres-15                                   ClusterIP   None            <none>        5432/TCP   152d
service/awx-service                                       ClusterIP   10.43.188.212   <none>        80/TCP     236d

$ kubectl -n awx logs awx-web-66cfcc4f8c-mvqpz -c awx-web
[...]
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 270, in connect
    self.connection = self.get_new_connection(conn_params)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
    connection = self.Database.connect(**conn_params)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/connection.py", line 728, in connect
    attempts = conninfo_attempts(params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/_conninfo_attempts.py", line 45, in conninfo_attempts
    raise e.OperationalError(str(last_exc))
psycopg.OperationalError: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
             ^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/__init__.py", line 161, in manage
    if (connection.pg_version // 10000) < 12:
        ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/connection.py", line 15, in __getattr__
    return getattr(self._connections[self._alias], item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/functional.py", line 57, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
                                         ^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/postgresql/base.py", line 436, in pg_version
    with self.temporary_connection():
  File "/usr/lib64/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 705, in temporary_connection
    with self.cursor() as cursor:
         ^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 330, in cursor
    return self._cursor()
           ^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 306, in _cursor
    self.ensure_connection()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 288, in ensure_connection
    with self.wrap_database_errors:
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/base/base.py", line 270, in connect
    self.connection = self.get_new_connection(conn_params)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
    connection = self.Database.connect(**conn_params)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/connection.py", line 728, in connect
    attempts = conninfo_attempts(params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/_conninfo_attempts.py", line 45, in conninfo_attempts
    raise e.OperationalError(str(last_exc))
django.db.utils.OperationalError: [Errno -2] Name or service not known
[...]

$ kubectl -n awx describe pod awx-postgres-15-0
[...]
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  SandboxChanged  42m   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          42m   kubelet  Container image "quay.io/ansible/awx:24.4.0" already present on machine
  Normal  Created         42m   kubelet  Created container init-database
  Normal  Started         42m   kubelet  Started container init-database

$ kubectl -n awx logs awx-task-9b6dcc459-6cp74 -c init-database
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2
[wait-for-migrations] Waiting 1 seconds before next attempt
[wait-for-migrations] Attempt 3
[wait-for-migrations] Waiting 2 seconds before next attempt
[wait-for-migrations] Attempt 4
[wait-for-migrations] Waiting 4 seconds before next attempt
[wait-for-migrations] Attempt 5
[wait-for-migrations] Waiting 8 seconds before next attempt
[wait-for-migrations] Attempt 6
[wait-for-migrations] Waiting 16 seconds before next attempt
[wait-for-migrations] Attempt 7
[wait-for-migrations] Waiting 30 seconds before next attempt
[...]

$ kubectl -n awx describe pod awx-postgres-15-0
Name:                      awx-postgres-15-0
Namespace:                 awx
Priority:                  0
Service Account:           default
Node:                      awx-prod-02/192.168.103.19
Start Time:                Fri, 19 Apr 2024 17:02:41 +0200
Labels:                    app.kubernetes.io/component=database
                           app.kubernetes.io/instance=postgres-15-awx
                           app.kubernetes.io/managed-by=awx-operator
                           app.kubernetes.io/name=postgres-15
                           app.kubernetes.io/part-of=awx
                           apps.kubernetes.io/pod-index=0
                           controller-revision-hash=awx-postgres-15-7cfb7786c4
                           statefulset.kubernetes.io/pod-name=awx-postgres-15-0
Annotations:               <none>
Status:                    Terminating (lasts 14h)
Termination Grace Period:  30s
IP:                        10.42.0.89
IPs:
  IP:           10.42.0.89
Controlled By:  StatefulSet/awx-postgres-15
Containers:
  postgres:
    Container ID:   containerd://42592c9d777a0e684e1b85d633ca2c6613017a0ca77dd19fd26f373a4312129b
    Image:          quay.io/sclorg/postgresql-15-c9s:latest
    Image ID:       quay.io/sclorg/postgresql-15-c9s@sha256:a785f34226ea06196f2ddc6253fbc70328f77412965b11e1f336fbda7ca21340
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 19 Apr 2024 17:03:17 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      POSTGRESQL_DATABASE:        <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_USER:            <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_PASSWORD:        <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_DB:                <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_USER:              <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      PGDATA:                     /var/lib/pgsql/data/userdata
      POSTGRES_INITDB_ARGS:       --auth-host=scram-sha-256
      POSTGRES_HOST_AUTH_METHOD:  scram-sha-256
    Mounts:
      /var/lib/pgsql/data from postgres-15 (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jv7sp (ro)
Conditions:
  Type               Status
  Initialized        True
  Ready              False
  ContainersReady    True
  PodScheduled       True
  DisruptionTarget   True
Volumes:
  postgres-15:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  postgres-15-awx-postgres-15-0
    ReadOnly:   false
  kube-api-access-jv7sp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

$ firewall-cmd --state
not running

Files

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  # These parameters are designed for use with:
  # - AWX Operator: 2.17.0
  #   https://github.com/ansible/awx-operator/blob/2.17.0/README.md

  admin_user: admin
  admin_password_secret: awx-admin-password

  ingress_type: ingress
  ingress_hosts:
    - hostname: awx.domain.net
      tls_secret: awx-secret-tls

  postgres_configuration_secret: awx-postgres-configuration

  postgres_data_volume_init: true
  postgres_storage_class: awx-postgres-volume
  postgres_storage_requirements:
    requests:
      storage: 8Gi

  projects_persistence: true
  projects_existing_claim: awx-projects-claim

  web_replicas: 1
  task_replicas: 1

  web_resource_requirements: {}
  task_resource_requirements: {}
  ee_resource_requirements: {}
  init_container_resource_requirements: {}
  postgres_resource_requirements: {}
  redis_resource_requirements: {}
  rsyslog_resource_requirements: {}

  # Uncomment to reveal "censored" logs
  #no_log: false

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: awx

generatorOptions:
  disableNameSuffixHash: true

secretGenerator:
  - name: awx-secret-tls
    type: kubernetes.io/tls
    files:
      - tls.crt
      - tls.key

  - name: awx-postgres-configuration
    type: Opaque
    literals:
      - host=awx-postgres-15
      - port=5432
      - database=awx
      - username=awx
      - password=my-secret-postgress-password
      - type=managed

  - name: awx-admin-password
    type: Opaque
    literals:
      - password=my-secret-awx-password

  # If you want to specify SECRET_KEY for your AWX manually, uncomment following lines and change the value.
  # Refer AAC documentation for detail about SECRET_KEY.
  # https://docs.ansible.com/automation-controller/latest/html/administration/secret_handling.html
  - name: awx-secret-key
    type: Opaque
    literals:
      - secret_key=my-secret-awx-key

resources:
  - pv.yaml
  - pvc.yaml
  - awx.yaml

I'm stuck in my troubleshooting steps, not knowing why the database is not made available to other pods/containers. Thanks in advance for your help !

nicosalvadore commented 1 month ago

My plan was to delete the resources and create them back while keeping the data stored in /data/postgres-15.

I tried to delete the k8s resoureces by running kubectl -n delete ns awx, but the CLI hanged and resources (mainly pods) were stuck in a Terminating state. So i rolled back to a previous VM snapshot, and then tried to kubectl delete -k base. Some resources were deleted, but still the command hanged. After a Ctrl-C, I could still see the pods stuck in a Terminating state.

So I ran the following commands :

kubectl delete pod/awx-task-9b6dcc459-4sfbm --grace-period=0 --force --namespace awx
kubectl delete pod/awx-web-66cfcc4f8c-nhg9k --grace-period=0 --force --namespace awx
kubectl delete pod/awx-postgres-15-0 --grace-period=0 --force --namespace awx
kubectl delete --grace-period=0 --force --namespace awx pod/awx-operator-controller-manager-7bd778dbbc-cnt2q
kubectl -n awx delete replicaset.apps/awx-operator-controller-manager-775bd7b75d
kubectl -n awx delete replicaset.apps/awx-operator-controller-manager-9874d5cfc

And then kubectl apply -k base. A few seconds later, all expected pods were running and the AWX UI+API was up.

To be honest I'm still not sure what happened, but it looks like it's solved.

Any idea ? Thanks !

kurokobo commented 1 month ago

@nicosalvadore Thanks for the report and for digging deeper into the details, it really helps me understand the situation better.

From what you shared, it seems more like the DB is frozen in a 'Terminating' state rather than just not starting up.

I think the backup by Veeam was probably taken while the VM was running. It seems like the backup's integrity could be in a crash-consistent, which might have led to data inconsistencies after restoring since the integrity for the internal data for K3s (etcd) wasn’t guaranteed.

So, I agree that forcefully deleting the resources stuck in 'Terminating' and redeploying is definitely the right approach.

Just one thing to double-check: are the credentials stored in AWX working correctly? If kubectl delete ns awx ended up deleting the awx-secret-key secret in the awx namespace, it might have been recreated by the AWX Operator, which could mean you can’t decrypt any sensitive info like credentials anymore.

nicosalvadore commented 1 month ago

Hi @kurokobo !

Thanks a lot for your answer. You might be correct about the frozen while terminating state. Still wondering about why though. Because you're right that the Veeam backup was done while the VM was running, which could have caused the issue. But it's kinda strange that the same issue occurred after migrating/converting the VM to Nutanix's hypervisor. Because in this case, VM snapshots are used to convert the VM. And I used snapshots often on this k3s VM when doing AWX operator upgrades.

It's possible that taking the backup itself caused the issue on the live VM, and that the services were down from that moment on, and thus were down too while migrating from vSphere to Nutanix. I admit I didn't check if AWX was up before starting the migration process. So it might just be bad luck, who knows...

The credentials are working correctly, yes ! I believe it's because I defined my own in kustomization.yaml.

Nevertheless, this issue has been a good learning exercise on AWX and k8s 😛

kurokobo commented 1 month ago

@nicosalvadore Thanks for updating!

It’s definitely a strange situation, but if you keep forcefully powering off the virtual machine, there might be times when it breaks and times when it doesn’t, so if bad luck strikes, something like this could happen. Nobody really knows...

The credentials are working correctly, yes ! I believe it's because I defined my own in kustomization.yaml.

Great! I'm relieved to hear that.

Nevertheless, this issue has been a good learning exercise on AWX and k8s 😛

Troubleshooting is always a great source of learning, especially when we can take our time with it. I also use the tons of questions I get from everyone as a way to learn myself, so thank you for sharing your trouble with me 😃

I’ll close this issue, but feel free to reach out if you need anything else.

kurokobo / awx-on-k3s