kurokobo / awx-on-k3s

An example implementation of AWX on single node K3s using AWX Operator, with easy-to-use simplified configuration with ownership of data and passwords.
MIT License
518 stars 143 forks source link

POD AWX-postgres-15-0 going in Init:CrashLoopBackOff during installation #343

Closed garet80 closed 2 months ago

garet80 commented 2 months ago

Environment

k3s version v1.29.3+k3s1 (8aecc26b) go version go1.21.8

Description

Hi guys, nice to meet you.

I have a problem on a fresh installation in a virtual machine Red Hat 8.9.

When I launch the deploy of AWX the POD awx-postgres-15-0 going in Init:CrashLoopBackOff

For what i see from log deployment of awx-operator-controller-manager when the start the task of Database return this in loop until it crash:

--------------------------- Ansible Task StdOut -------------------------------

TASK [installer : Wait for Database to initialize if managed DB] ***************
task path: /opt/ansible/roles/installer/tasks/database_configuration.yml:240

-------------------------------------------------------------------------------
{"level":"info","ts":"2024-04-15T08:26:44Z","logger":"logging_event_handler","msg":"[playbook task start]","name":"awx","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"6107868404438235086","EventData.Name":"installer : Wait for Database to initialize if managed DB"}
{"level":"info","ts":"2024-04-15T08:26:44Z","logger":"proxy","msg":"cache miss: /v1, Kind=PodList err-Index with name field:status.phase does not exist"}

Before POD crashing I have found this warning in POD description:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  4m38s  default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

I have follow the guide and don't have done any particular customization.

I already try these commands:

kubectl patch pv <PV Name> -p '{"spec":{"claimRef": null}}'
# Delete the PV
kubectl delete pv <PV Name>

# Recreate the PV
kubectl apply -k base

as reported in the https://github.com/kurokobo/awx-on-k3s/blob/main/tips/troubleshooting.md but the error remain the same.

On this server is already installed and running an old version of AWX (17.1.0) on docker, I don't think this can be a problem for the deployment.

$ docker ps
CONTAINER ID   IMAGE                                       COMMAND                  CREATED         STATUS      PORTS
                                    NAMES
80e0cce75404   quay.io/purestorage/pure-exporter:1.2.5-a   "gunicorn pure_expor…"   23 months ago   Up 4 days   0.0.0.0:9491->9491/tcp, :::9491->9491/tcp   pure-exporter
0a40ca11753e   ansible/awx:17.1.0                          "/usr/bin/tini -- /u…"   3 years ago     Up 4 days   8052/tcp                                    awx_task
150070d7647c   ansible/awx:17.1.0                          "/usr/bin/tini -- /b…"   3 years ago     Up 4 days   0.0.0.0:8180->8052/tcp, :::8180->8052/tcp   awx_web
39d835691e62   postgres:12                                 "docker-entrypoint.s…"   3 years ago     Up 4 days   5432/tcp                                    awx_postgres
29f1d34fb398   f9b990972689                                "docker-entrypoint.s…"   3 years ago     Up 4 days   6379/tcp                                    awx_redis
$

Do you have any suggestion about this error?

Thanks Regards

Step to Reproduce

kubectl apply -k base

Logs

$ kubectl -n awx get pod
NAME                                              READY   STATUS                  RESTARTS          AGE
awx-operator-controller-manager-9874d5cfc-g9dr6   2/2     Running                 0                 2d12h
awx-postgres-15-0                                 0/1     Init:CrashLoopBackOff   718 (4m21s ago)   2d12h
$

$ kubectl -n awx get all
NAME                                                  READY   STATUS                  RESTARTS          AGE
pod/awx-operator-controller-manager-9874d5cfc-g9dr6   2/2     Running                 0                 2d13h
pod/awx-postgres-15-0                                 0/1     Init:CrashLoopBackOff   721 (4m46s ago)   2d13h

NAME                                                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.196.49   <none>        8443/TCP   2d13h
service/awx-postgres-15                                   ClusterIP   None           <none>        5432/TCP   2d13h

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   1/1     1            1           2d13h

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-controller-manager-9874d5cfc   1         1         1       2d13h

NAME                               READY   AGE
statefulset.apps/awx-postgres-15   0/1     2d13h
$

$ kubectl -n awx describe pod awx-postgres-15-0

Name:             awx-postgres-15-0
Namespace:        awx
Priority:         0
Service Account:  default
Node:             vm-ansible-run-01/172.26.2.118
Start Time:       Fri, 12 Apr 2024 21:42:57 +0200
Labels:           app.kubernetes.io/component=database
                  app.kubernetes.io/instance=postgres-15-awx
                  app.kubernetes.io/managed-by=awx-operator
                  app.kubernetes.io/name=postgres-15
                  app.kubernetes.io/part-of=awx
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=awx-postgres-15-5d69bb47df
                  statefulset.kubernetes.io/pod-name=awx-postgres-15-0
Annotations:      <none>
Status:           Pending
IP:               10.42.0.22
IPs:
  IP:           10.42.0.22
Controlled By:  StatefulSet/awx-postgres-15
Init Containers:
  init:
    Container ID:  containerd://560b08123f0be1ad141ef5fb16ccccd427ca79e1caa7cf6bb986a4ba556a35be
    Image:         quay.io/sclorg/postgresql-15-c9s:latest
    Image ID:      quay.io/sclorg/postgresql-15-c9s@sha256:b8b927c2c5b67299cd7d840a8c885638dcb09650057a50e610e177692a1ce9b7
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      chown 26:0 /var/lib/pgsql/data
      chmod 700 /var/lib/pgsql/data

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Mon, 15 Apr 2024 10:34:51 +0200
      Finished:     Mon, 15 Apr 2024 10:34:51 +0200
    Ready:          False
    Restart Count:  718
    Environment:    <none>
    Mounts:
      /var/lib/pgsql/data from postgres-15 (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ddjbx (ro)
Containers:
  postgres:
    Container ID:
    Image:          quay.io/sclorg/postgresql-15-c9s:latest
    Image ID:
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      POSTGRESQL_DATABASE:        <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_USER:            <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_PASSWORD:        <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_DB:                <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_USER:              <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      PGDATA:                     /var/lib/pgsql/data/userdata
      POSTGRES_INITDB_ARGS:       --auth-host=scram-sha-256
      POSTGRES_HOST_AUTH_METHOD:  scram-sha-256
    Mounts:
      /var/lib/pgsql/data from postgres-15 (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ddjbx (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  postgres-15:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  postgres-15-awx-postgres-15-0
    ReadOnly:   false
  kube-api-access-ddjbx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Warning  BackOff  104s (x16838 over 2d12h)  kubelet  Back-off restarting failed container init in pod awx-postgres-15-0_awx(6d782c57-57d8-4639-960e-1d36d68aaae9)

kubectl -n awx logs -f awx-postgres-15-0 -c postgres
Error from server (BadRequest): container "postgres" in pod "awx-postgres-15-0" is waiting to start: PodInitializing

$ kubectl -n awx logs -f awx-postgres-15-0 -c postgres
Error from server (BadRequest): container "postgres" in pod "awx-postgres-15-0" is waiting to start: PodInitializing
$

$kubectl -n awx logs -f deployments/awx-operator-controller-manager
[...]
FAILED - RETRYING: [localhost]: Wait for Database to initialize if managed DB (1 retries left).\nfatal: [localhost]: FAILED! => {\"api_found\": true, \"attempts\": 60, \"changed\": false, \"resources\": []}\n\r\nPLAY RECAP *********************************************************************\r\nlocalhost                  : ok=49   changed=0    unreachable=0    failed=1    skipped=28   rescued=0    ignored=0   \n","job":"6107868404438235086","name":"awx","namespace":"awx","error":"exit status 2","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/runner.(*runner).Run.func1\n\tansible-operator-plugins/internal/ansible/runner/runner.go:269"}
----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx/awx) -----

PLAY RECAP *********************************************************************
localhost                  : ok=49   changed=0    unreachable=0    failed=1    skipped=28   rescued=0    ignored=0

----------
{"level":"error","ts":"2024-04-15T08:32:30Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"ff844827-ee6f-4f41-98a7-a263d8fe6380","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

--------------------------- Ansible Task StdOut -------------------------------

TASK [installer : Wait for Database to initialize if managed DB] ***************
task path: /opt/ansible/roles/installer/tasks/database_configuration.yml:240

-------------------------------------------------------------------------------
{"level":"info","ts":"2024-04-15T08:26:44Z","logger":"logging_event_handler","msg":"[playbook task start]","name":"awx","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"6107868404438235086","EventData.Name":"installer : Wait for Database to initialize if managed DB"}
{"level":"info","ts":"2024-04-15T08:26:44Z","logger":"proxy","msg":"cache miss: /v1, Kind=PodList err-Index with name field:status.phase does not exist"}

POD DESCRIBE BEFORE CRASH:

$ kubectl get pods -n awx
NAME                                              READY   STATUS    RESTARTS   AGE
awx-operator-controller-manager-9874d5cfc-hhcwg   2/2     Running   0          8m38s
awx-postgres-15-0                                 0/1     Pending   0          5m33s
$

$ kubectl get pv
NAME                     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS          VOLUMEATTRIBUTESCLASS   REASON   AGE
awx-projects-volume      2Gi        RWO            Retain           Bound    awx/awx-projects-claim              awx-projects-volume   <unset>                          34s
awx-postgres-15-volume   8Gi        RWO            Retain           Bound    awx/postgres-15-awx-postgres-15-0   awx-postgres-volume   <unset>                          34s

$ kubectl -n awx describe pod awx-postgres-15
Name:             awx-postgres-15-0
Namespace:        awx
Priority:         0
Service Account:  default
Node:             vm-ansible-run-01/172.26.2.118
Start Time:       Mon, 15 Apr 2024 11:32:16 +0200
Labels:           app.kubernetes.io/component=database
                  app.kubernetes.io/instance=postgres-15-awx
                  app.kubernetes.io/managed-by=awx-operator
                  app.kubernetes.io/name=postgres-15
                  app.kubernetes.io/part-of=awx
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=awx-postgres-15-5d69bb47df
                  statefulset.kubernetes.io/pod-name=awx-postgres-15-0
Annotations:      <none>
Status:           Pending
IP:               10.42.0.30
IPs:
  IP:           10.42.0.30
Controlled By:  StatefulSet/awx-postgres-15
Init Containers:
  init:
    Container ID:  containerd://050c77c5bf40efb167acc09187773b82e340932f14bc6f307cf236509bd618c6
    Image:         quay.io/sclorg/postgresql-15-c9s:latest
    Image ID:      quay.io/sclorg/postgresql-15-c9s@sha256:b8b927c2c5b67299cd7d840a8c885638dcb09650057a50e610e177692a1ce9b7
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      chown 26:0 /var/lib/pgsql/data
      chmod 700 /var/lib/pgsql/data

    State:          Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Mon, 15 Apr 2024 11:32:33 +0200
      Finished:     Mon, 15 Apr 2024 11:32:33 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Mon, 15 Apr 2024 11:32:18 +0200
      Finished:     Mon, 15 Apr 2024 11:32:18 +0200
    Ready:          False
    Restart Count:  2
    Environment:    <none>
    Mounts:
      /var/lib/pgsql/data from postgres-15 (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6xglc (ro)
Containers:
  postgres:
    Container ID:
    Image:          quay.io/sclorg/postgresql-15-c9s:latest
    Image ID:
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      POSTGRESQL_DATABASE:        <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_USER:            <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_PASSWORD:        <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_DB:                <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_USER:              <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      PGDATA:                     /var/lib/pgsql/data/userdata
      POSTGRES_INITDB_ARGS:       --auth-host=scram-sha-256
      POSTGRES_HOST_AUTH_METHOD:  scram-sha-256
    Mounts:
      /var/lib/pgsql/data from postgres-15 (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6xglc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  postgres-15:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  postgres-15-awx-postgres-15-0
    ReadOnly:   false
  kube-api-access-6xglc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  24s               default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         22s               default-scheduler  Successfully assigned awx/awx-postgres-15-0 to vm-ansible-run-01
  Normal   Pulled            5s (x3 over 22s)  kubelet            Container image "quay.io/sclorg/postgresql-15-c9s:latest" already present on machine
  Normal   Created           5s (x3 over 22s)  kubelet            Created container init
  Normal   Started           5s (x3 over 21s)  kubelet            Started container init
  Warning  BackOff           4s (x2 over 19s)  kubelet            Back-off restarting failed container init in pod awx-postgres-15-0_awx(330eb03d-c9a6-43aa-9fd4-10e829d256a4)
$

$

Files

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
...
kurokobo commented 2 months ago

@garet80 Hi, the status Init:CrashLoopBackOff means the init container's failure. So could you please gather logs from init container?

kubectl -n awx logs awx-postgres-15-0 -c init

F.Y.I., the Warning message you've found (FailedScheduling) is solved in the next line (Scheduled). This is not the cause of your issue.

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  24s               default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         22s               default-scheduler  Successfully assigned awx/awx-postgres-15-0 to vm-ansible-run-01
garet80 commented 2 months ago

@kurokobo thanks for the answer and the information about the error.

This is the output of your requested command:

$ kubectl -n awx logs awx-postgres-15-0 -c init
Fatal glibc error: CPU does not support x86-64-v2
$

The VM is created on VMWare 7

Edit:

searching on web I think I have found the solution:

https://community.veeam.com/kasten-k10-support-92/fatal-glibc-error-cpu-does-not-support-x86-64-v2-4936?postid=41619#post41619

"If you are running on ESXi, ensure the cluster EVC mode is ‘Haswell’ or higher. EVC settings lower than Haswell disable chipset functions that is required by x86-64-v2"

If we can't change EVC there are some other way to deploy AWX (especially postgres POD)?

Thanks Regards

kurokobo commented 2 months ago

@garet80

Thanks for providing logs! As you found, this is caused the chipset functions on your vCPUs.

If we can't change EVC there are some other way to deploy AWX (especially postgres POD)?

Unfortunately, no, x86-64-v2 support is mandatory for the recent AWX and PSQL. Because these container images are built based on CentOS Stream 9 which requires x86-64-v2 support: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/9.0_release_notes/architectures

adel-barout commented 2 months ago

i am installing aws for the first time and i have the same issue

image

image

kurokobo commented 2 months ago

@adel-barout Not the same issue, please read our comments😞 And please follow my guide, maybe you've skipped some steps, since the steps in my guide are designed to avoid any Permission denied.

adel-barout commented 2 months ago

@kurokobo, thanks you a lot for the quick response. It works now

garet80 commented 2 months ago

@kurokobo I have change the compatibility of the VM and now works all fine thanks for the help.

I'm going to use this space for another question before close it:

it's possible to bind a directory external to the awx pod where put the playbook and tell from awx to use them (in the old version was possible to use a directory external to docker container changing the yaml installer, for example i have set has directory /var/lib/awx/projects/ directory)?

Thanks regards

kurokobo commented 2 months ago

@garet80 For such purpose, my guide is already designed to mount /data/projects on your host as /var/lib/awx/projects in AWX. If you've followed my guide, maybe you already have that.

Refer to: https://github.com/kurokobo/awx-on-k3s/blob/main/tips/manual-project.md

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 4 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been open 2 weeks with no activity.