error on deployment: Reconciler error

madsholme commented 1 month ago

Environment

k3s version v1.29.6+k3s2 (b4b156d9) go version go1.21.11

cat /etc/os-release PRETTY_NAME="Ubuntu 24.04.1 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.1 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo

Description

Followed the guide in this repo, and in the end i got:

----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx/awx) -----

PLAY RECAP *********************************************************************
localhost                  : ok=67   changed=0    unreachable=0    failed=1    skipped=70   rescued=0    ignored=0

----------
{"level":"error","ts":"2024-09-22T08:31:11Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"1aa486ba-1f7f-489a-9714-d95a57d9ab06","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

I tried to redo the machine and run it again, and it ends with the same error.. any idea on whats missing?

Step to Reproduce

git clone https://github.com/kurokobo/awx-on-k3s.git
cd awx-on-k3s
git checkout 2.19.1
kubectl apply -k operator
AWX_HOST="awx.home.local"
openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -out ./base/tls.crt -keyout ./base/tls.key -subj "/CN=${AWX_HOST}/O=${AWX_HOST}" -addext "subjectAltName = DNS:${AWX_HOST}"
edit: base/awx.yaml and base/kustomization.yaml
sudo mkdir -p /data/postgres-15
sudo mkdir -p /data/projects
sudo chown 1000:0 /data/projects
kubectl apply -k base
kubectl -n awx logs -f deployments/awx-operator-controller-manager

root@ubuntutest01:~/awx-on-k3s# kubectl -n awx get awx,all,ingress,secrets
NAME                      AGE
awx.awx.ansible.com/awx   64m

NAME                                                   READY   STATUS             RESTARTS         AGE
pod/awx-operator-controller-manager-745b55d94b-mkk5b   2/2     Running            0                64m
pod/awx-postgres-15-0                                  1/1     Running            0                64m
pod/awx-task-654b46cf66-87s96                          0/4     Init:0/3           0                63m
pod/awx-web-66f7b8bcf6-pcn92                           2/3     CrashLoopBackOff   10 (2m33s ago)   63m

NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.43.2.3       <none>        8443/TCP   64m
service/awx-postgres-15                                   ClusterIP   None            <none>        5432/TCP   64m
service/awx-service                                       ClusterIP   10.43.172.186   <none>        80/TCP     63m

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   1/1     1            1           64m
deployment.apps/awx-task                          0/1     1            0           63m
deployment.apps/awx-web                           0/1     1            0           63m

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-controller-manager-745b55d94b   1         1         1       64m
replicaset.apps/awx-task-654b46cf66                          1         1         0       63m
replicaset.apps/awx-web-66f7b8bcf6                           1         1         0       63m

NAME                               READY   AGE
statefulset.apps/awx-postgres-15   1/1     64m

NAME                                    CLASS     HOSTS            ADDRESS         PORTS     AGE
ingress.networking.k8s.io/awx-ingress   traefik   awx.home.local   192.168.0.164   80, 443   63m

NAME                                  TYPE                DATA   AGE
secret/awx-admin-password             Opaque              1      64m
secret/awx-app-credentials            Opaque              3      63m
secret/awx-broadcast-websocket        Opaque              1      64m
secret/awx-postgres-configuration     Opaque              6      64m
secret/awx-receptor-ca                kubernetes.io/tls   2      64m
secret/awx-receptor-work-signing      Opaque              2      63m
secret/awx-secret-key                 Opaque              1      64m
secret/awx-secret-tls                 kubernetes.io/tls   2      64m
secret/redhat-operators-pull-secret   Opaque              1      64m

kurokobo commented 1 month ago

@madsholme Hi,

pod/awx-web-66f7b8bcf6-pcn92 2/3 CrashLoopBackOff 10 (2m33s ago) 63m

Typical trouble for this situation is related to IPv6, but logs are needed to make a determination. Could you please share the logs from web pods?

kubectl -n awx logs deployment/awx-web -c awx-web
kubectl -n awx logs deployment/awx-web -c awx-rsyslog
kubectl -n awx logs deployment/awx-web -c redis

madsholme commented 1 month ago

@kurokobo Hi Thanks for the answer.

sure here are the logs

ubuntutest01:~/awx-on-k3s# kubectl -n awx logs deployment/awx-web -c awx-web
2024-09-22 10:53:39,568 INFO RPC interface 'supervisor' initialized
2024-09-22 10:53:39,568 INFO RPC interface 'supervisor' initialized
2024-09-22 10:53:39,568 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-09-22 10:53:39,568 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-09-22 10:53:39,568 INFO supervisord started with pid 7
2024-09-22 10:53:39,568 INFO supervisord started with pid 7
2024-09-22 10:53:40,571 INFO spawned: 'superwatcher' with pid 13
2024-09-22 10:53:40,571 INFO spawned: 'superwatcher' with pid 13
2024-09-22 10:53:40,572 INFO spawned: 'nginx' with pid 14
2024-09-22 10:53:40,572 INFO spawned: 'nginx' with pid 14
2024-09-22 10:53:40,573 INFO spawned: 'uwsgi' with pid 15
2024-09-22 10:53:40,573 INFO spawned: 'uwsgi' with pid 15
2024-09-22 10:53:40,574 INFO spawned: 'daphne' with pid 16
2024-09-22 10:53:40,574 INFO spawned: 'daphne' with pid 16
2024-09-22 10:53:40,576 INFO spawned: 'awx-cache-clear' with pid 17
2024-09-22 10:53:40,576 INFO spawned: 'awx-cache-clear' with pid 17
2024-09-22 10:53:40,577 INFO spawned: 'ws-heartbeat' with pid 18
2024-09-22 10:53:40,577 INFO spawned: 'ws-heartbeat' with pid 18
READY
[uWSGI] getting INI configuration from /etc/tower/uwsgi.ini
*** Starting uWSGI 2.0.24 (64bit) on [Sun Sep 22 10:53:40 2024] ***
compiled with version: 11.4.1 20231218 (Red Hat 11.4.1-3) on 02 July 2024 20:15:47
os: Linux-6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
nodename: awx-web-66f7b8bcf6-pcn92
machine: x86_64
clock source: unix
detected number of CPU cores: 16
current working directory: /var/lib/awx
detected binary path: /var/lib/awx/venv/awx/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address 127.0.0.1:8050 fd 3
Python version: 3.11.9 (main, Jun 11 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x79616e431cf8
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 609552 bytes (595 KB) for 5 cores
*** Operational MODE: preforking ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 15)
spawned uWSGI worker 1 (pid: 20, cores: 1)
spawned uWSGI worker 2 (pid: 21, cores: 1)
spawned uWSGI worker 3 (pid: 22, cores: 1)
spawned uWSGI worker 4 (pid: 23, cores: 1)
spawned uWSGI worker 5 (pid: 24, cores: 1)
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
2024-09-22 10:53:41,616 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-09-22 10:53:41,616 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-09-22 10:54:18,142 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:18,142 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:35,392 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:35,392 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:35,879 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:35,879 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:36,409 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:36,409 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:36,840 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-22 10:54:36,840 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)

ubuntutest01:~/awx-on-k3s# kubectl -n awx logs deployment/awx-web -c awx-rsyslog
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2
[wait-for-migrations] Waiting 1 seconds before next attempt
[wait-for-migrations] Attempt 3
[wait-for-migrations] Waiting 2 seconds before next attempt
[wait-for-migrations] Attempt 4
[wait-for-migrations] Waiting 4 seconds before next attempt
[wait-for-migrations] Attempt 5
[wait-for-migrations] Waiting 8 seconds before next attempt
[wait-for-migrations] Attempt 6
[wait-for-migrations] Waiting 16 seconds before next attempt
[wait-for-migrations] Attempt 7
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 8
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 9
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 10
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 11
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 12
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 13
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 14
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 15
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 16
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 17
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 18
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 19
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 20
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 21
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 22
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 23
[wait-for-migrations] Waiting 30 seconds before next attempt
[wait-for-migrations] Attempt 24
[wait-for-migrations] Waiting 30 seconds before next attempt

root@ubuntutest01:~/awx-on-k3s# kubectl -n awx logs deployment/awx-web -c redis
1:C 22 Sep 2024 07:37:32.490 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 22 Sep 2024 07:37:32.490 * Redis version=7.4.0, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 22 Sep 2024 07:37:32.490 * Configuration loaded
1:M 22 Sep 2024 07:37:32.490 * monotonic clock: POSIX clock_gettime
1:M 22 Sep 2024 07:37:32.491 * Running mode=standalone, port=0.
1:M 22 Sep 2024 07:37:32.491 * Server initialized
1:M 22 Sep 2024 07:37:32.491 * Ready to accept connections unix
root@ubuntutest01:~/awx-on-k3s#

thanks for taking the time.

also dont know if its usable, but here is the ip a output

ubuntutest01:~/awx-on-k3s# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:15:5d:00:aa:00 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.164/24 metric 100 brd 192.168.0.255 scope global dynamic eth0
       valid_lft 167294sec preferred_lft 167294sec
    inet6 fe80::215:5dff:fe00:aa00/64 scope link
       valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether ba:7b:65:a3:0a:ca brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::b87b:65ff:fea3:aca/64 scope link
       valid_lft forever preferred_lft forever
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 5a:65:7f:1a:85:03 brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::5865:7fff:fe1a:8503/64 scope link
       valid_lft forever preferred_lft forever
6: veth65cee607@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether 86:d1:57:e1:bd:a6 brd ff:ff:ff:ff:ff:ff link-netns cni-b1108563-76c1-6239-83ff-39fa3d34d600
    inet6 fe80::84d1:57ff:fee1:bda6/64 scope link
       valid_lft forever preferred_lft forever
7: veth9da25ed9@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether ba:97:6e:db:88:60 brd ff:ff:ff:ff:ff:ff link-netns cni-1e99bd93-e750-beaa-081b-9200a01ddc3c
    inet6 fe80::b897:6eff:fedb:8860/64 scope link
       valid_lft forever preferred_lft forever
8: veth1194eb79@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether 4a:70:99:73:40:0d brd ff:ff:ff:ff:ff:ff link-netns cni-da9d8e9e-f007-7811-8510-301eae653385
    inet6 fe80::4870:99ff:fe73:400d/64 scope link
       valid_lft forever preferred_lft forever
10: veth2b1ced5b@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether fa:80:64:e1:28:45 brd ff:ff:ff:ff:ff:ff link-netns cni-d7d2f88e-2ed4-916d-7072-a3e80f15fda9
    inet6 fe80::6c40:6cff:fe0d:b194/64 scope link
       valid_lft forever preferred_lft forever
11: veth56849f55@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether de:35:95:19:5d:af brd ff:ff:ff:ff:ff:ff link-netns cni-b40a3648-bfc7-6f96-a240-a51314b5fd37
    inet6 fe80::dc35:95ff:fe19:5daf/64 scope link
       valid_lft forever preferred_lft forever
12: veth8c1d698f@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether a6:8c:90:77:bb:ef brd ff:ff:ff:ff:ff:ff link-netns cni-b9917d8a-bcba-7ff6-c3a9-61d62f215558
    inet6 fe80::a48c:90ff:fe77:bbef/64 scope link
       valid_lft forever preferred_lft forever
13: veth467f23e8@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether e6:a9:d6:8e:11:1b brd ff:ff:ff:ff:ff:ff link-netns cni-fe656587-3829-9a57-52cc-f43c5c580392
    inet6 fe80::e4a9:d6ff:fe8e:111b/64 scope link
       valid_lft forever preferred_lft forever
14: vethee130e3f@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether fa:b0:bf:09:ff:df brd ff:ff:ff:ff:ff:ff link-netns cni-33094bb1-84d3-bd1b-4d77-6c48289f91a5
    inet6 fe80::f8b0:bfff:fe09:ffdf/64 scope link
       valid_lft forever preferred_lft forever
15: veth8dd697de@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether 06:f0:97:98:ac:0d brd ff:ff:ff:ff:ff:ff link-netns cni-6189ce5f-3c9a-ef23-5ea3-7163ebc2e7b3
    inet6 fe80::4f0:97ff:fe98:ac0d/64 scope link
       valid_lft forever preferred_lft forever

kurokobo commented 1 month ago

@madsholme Thanks for the quick update. It doesn’t seem to be related to the IPv6 issue, so we need to do a bit more detailed investigation. First of all could you please gather the complete logs from the Operator and attach them as a text file?

madsholme commented 1 month ago

@kurokobo thanks for the fast reply.

yes i hope these are the ones

ubuntutest01:~/awx-on-k3s# ls /var/log/containers/
awx-operator-controller-manager-745b55d94b-mkk5b_awx_awx-manager-78d6baed6839dcec099bfaea10180dec2f92fcfef57b87fe6873902e2009e398.log
awx-operator-controller-manager-745b55d94b-mkk5b_awx_kube-rbac-proxy-0fb979aff8b9d7c795d60adf7e1e59102319f66554aec2d9aeee9130927d1d89.log
awx-postgres-15-0_awx_init-521d58f076c925ce7860c76c6704bc68066ee7771978c158c5cbbb47b62bbe80.log
awx-postgres-15-0_awx_postgres-f4ba4a9d1b25ccbc897f51bdd214e84e5b3b7a91a6f9f531125e487db1b34ae8.log
awx-task-654b46cf66-87s96_awx_init-database-961e71779ff7c2834df33d5d554c86bf5476b5c1fe29662c7351fed64696b69a.log
awx-web-66f7b8bcf6-pcn92_awx_awx-rsyslog-19563b206a4215aef2e9fd3055a1e44b5ad1959c0ff427d4eaf06e81e0d58b51.log
awx-web-66f7b8bcf6-pcn92_awx_awx-web-3adb426c8f7820bed03ebd7d3e1b17101568058cbfb220972e16f1518707b9a5.log
awx-web-66f7b8bcf6-pcn92_awx_init-projects-22d308dd35580ff4a64491acc8e1323daef488f6fa37bc465b1a7d09e7c5004d.log
awx-web-66f7b8bcf6-pcn92_awx_redis-2f6ce2e68429136acea9b20f14642997346d0a16666a9467b5d362a7abed75a3.log
coredns-6799fbcd5-4r25t_kube-system_coredns-0cb80abf7b8b1db8f6f54f6611692d55aba9ebdd7297e41d7cd25e08c1cf6607.log
helm-install-traefik-9qzvg_kube-system_helm-fdafe362de0ff4c36a53f1af9fceb064c0bc12868ba388044ba1a32922447c89.log
helm-install-traefik-crd-vn6d5_kube-system_helm-791fd9184f3bc07b73b9ee0c7a308d6d9751bfebc29319fc2ad9c3194666b090.log
local-path-provisioner-6f5d79df6-qlnpb_kube-system_local-path-provisioner-3d4da4ce61a9595a695fcbdcd0f1d453bd87d3c9f94ff602d5416956f8830d03.log
metrics-server-54fd9b65b-d7sns_kube-system_metrics-server-dd36120aa6205050710ae601210169910fdf5f0abca6736fa3614f9cbdd53463.log
svclb-traefik-6fd58450-2d7zj_kube-system_lb-tcp-443-bac887b6e70dcfbcfc75dab687bc1a348878c505ac2f38b7325be2e5d5054539.log
svclb-traefik-6fd58450-2d7zj_kube-system_lb-tcp-80-6b4407375034b4d434102eff2d32d4ceb6820ff285326e42b6f2ac2ed0a806e4.log

awx-operator-controller-manager-745b55d94b-mkk5b_awx_awx-manager-78d6baed6839dcec099bfaea10180dec2f92fcfef57b87fe6873902e2009e398.log awx-operator-controller-manager-745b55d94b-mkk5b_awx_kube-rbac-proxy-0fb979aff8b9d7c795d60adf7e1e59102319f66554aec2d9aeee9130927d1d89.log

kurokobo commented 1 month ago

@madsholme Thanks for the logs.

According to the logs, the task "Check for pending migrations" has failed. Typically, this is often due to issues with the connection to the database. For example, have you ever modified the db password in the base/kustomization.yaml after attempting an initial deployment? It seems that the data in /data/postgres-15 might still be initialized with the old password and may not match the latest base/kustomization.yaml.

If possible, could you delete all the data and try the deployment again?

# follow the steps in awx-on-k3s directory
cd awx-on-k3s

# remove awx related resouces and awx namespace
kubectl -n awx delete pvc postgres-15-awx-postgres-15-0 --wait=false
kubectl delete -k base
kubectl delete ns awx

# remove actual files for db
sudo rm -rf /data/postgres-15

# redeploy awx operator
kubectl apply -k operator

# redeploy
sudo mkdir -p /data/postgres-15
kubectl apply -k base
kubectl -n awx logs -f deployments/awx-operator-controller-manager

madsholme commented 1 month ago

@kurokobo

I just chose a state before the awx deployment, only thing i changed was hostname: awx.home.local in base/awx.yaml all passwords, are default now.

and i still get

PLAY RECAP *********************************************************************
localhost                  : ok=67   changed=0    unreachable=0    failed=1    skipped=70   rescued=0    ignored=0

----------
{"level":"error","ts":"2024-09-22T18:33:28Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"5febe5cb-f05a-4c14-828e-d59b8379a9a5","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

kurokobo commented 1 month ago

@madsholme Thanks for updating, unfortunately, I can't determine the cause immediately. I tried creating a similar environment as a new virtual machine in vSphere, but I couldn't reproduce the issue.

$ uname -a
Linux ubuntu 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

In my environment, after the Pod starts, the Web Pod logs record messages like the following (the lines with ✅) within a few seconds that are not included in your logs.

$ kubectl -n awx logs deployment/awx-web -c awx-web
2024-09-23 12:49:04,363 INFO RPC interface 'supervisor' initialized
2024-09-23 12:49:04,363 INFO RPC interface 'supervisor' initialized
2024-09-23 12:49:04,363 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-09-23 12:49:04,363 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-09-23 12:49:04,363 INFO supervisord started with pid 7
2024-09-23 12:49:04,363 INFO supervisord started with pid 7
2024-09-23 12:49:05,366 INFO spawned: 'superwatcher' with pid 13
2024-09-23 12:49:05,366 INFO spawned: 'superwatcher' with pid 13
2024-09-23 12:49:05,369 INFO spawned: 'nginx' with pid 14
2024-09-23 12:49:05,369 INFO spawned: 'nginx' with pid 14
2024-09-23 12:49:05,373 INFO spawned: 'uwsgi' with pid 15
2024-09-23 12:49:05,373 INFO spawned: 'uwsgi' with pid 15
2024-09-23 12:49:05,376 INFO spawned: 'daphne' with pid 16
2024-09-23 12:49:05,376 INFO spawned: 'daphne' with pid 16
2024-09-23 12:49:05,381 INFO spawned: 'awx-cache-clear' with pid 17
2024-09-23 12:49:05,381 INFO spawned: 'awx-cache-clear' with pid 17
2024-09-23 12:49:05,386 INFO spawned: 'ws-heartbeat' with pid 18
2024-09-23 12:49:05,386 INFO spawned: 'ws-heartbeat' with pid 18
READY
[uWSGI] getting INI configuration from /etc/tower/uwsgi.ini
*** Starting uWSGI 2.0.24 (64bit) on [Mon Sep 23 12:49:05 2024] ***
compiled with version: 11.4.1 20231218 (Red Hat 11.4.1-3) on 02 July 2024 20:15:47
os: Linux-6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024
nodename: awx-web-767c9754c6-9rwc8
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /var/lib/awx
detected binary path: /var/lib/awx/venv/awx/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address 127.0.0.1:8050 fd 3
Python version: 3.11.9 (main, Jun 11 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x78a455947cf8
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 609552 bytes (595 KB) for 5 cores
*** Operational MODE: preforking ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 15)
spawned uWSGI worker 1 (pid: 20, cores: 1)
spawned uWSGI worker 2 (pid: 21, cores: 1)
spawned uWSGI worker 3 (pid: 22, cores: 1)
spawned uWSGI worker 4 (pid: 23, cores: 1)
spawned uWSGI worker 5 (pid: 24, cores: 1)
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
mounting awx.wsgi:application on /
2024-09-23 12:49:06,480 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-09-23 12:49:06,480 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
✅ 2024-09-23 12:49:10,213 INFO     [-] daphne.cli Starting server at tcp:port=8051:interface=127.0.0.1
✅ 2024-09-23 12:49:10,213 INFO     Starting server at tcp:port=8051:interface=127.0.0.1
✅ 2024-09-23 12:49:10,246 INFO     [-] daphne.server HTTP/2 support not enabled (install the http2 and tls Twisted extras)
✅ 2024-09-23 12:49:10,246 INFO     HTTP/2 support not enabled (install the http2 and tls Twisted extras)
✅ 2024-09-23 12:49:10,246 INFO     [-] daphne.server Configuring endpoint tcp:port=8051:interface=127.0.0.1
✅ 2024-09-23 12:49:10,246 INFO     Configuring endpoint tcp:port=8051:interface=127.0.0.1
✅ 2024-09-23 12:49:10,246 INFO     [-] daphne.server Listening on TCP address 127.0.0.1:8051
✅ 2024-09-23 12:49:10,246 INFO     Listening on TCP address 127.0.0.1:8051
✅ WSGI app 0 (mountpoint='/') ready in 7 seconds on interpreter 0x78a455947cf8 pid: 23 (default app)
✅ WSGI app 0 (mountpoint='/') ready in 7 seconds on interpreter 0x78a455947cf8 pid: 20 (default app)
✅ WSGI app 0 (mountpoint='/') ready in 7 seconds on interpreter 0x78a455947cf8 pid: 21 (default app)
✅ WSGI app 0 (mountpoint='/') ready in 7 seconds on interpreter 0x78a455947cf8 pid: 24 (default app)
✅ WSGI app 0 (mountpoint='/') ready in 7 seconds on interpreter 0x78a455947cf8 pid: 22 (default app)
2024-09-23 12:49:36,034 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,034 INFO success: nginx entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,034 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,034 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,034 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,034 INFO success: daphne entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,035 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,035 INFO success: awx-cache-clear entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,035 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-09-23 12:49:36,035 INFO success: ws-heartbeat entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)

In your environment, there seems to be a problem with the connection from the Web Pod to the DB, or for some other reason, the Web Pod is not functioning properly, which is causing the Operator to fail subsequent tasks.

I apologize for not being able to help more, but if you can continue to troubleshoot, I would like you to try the following:

Check the connectivity from the Web Pod to the DB.
- kubectl -n awx exec -it deployment/awx-web -c awx-rsyslog -- bash then PGPASSWORD='Ansible123!' psql -h awx-postgres-15 -U awx -d awx -c '\d'
Disable any firewalls like ufw.
Set SELinux to Permissive mode.
Uninstall K3s, remove /data, restart the OS, and then try reinstalling K3s from scratch.
Test in a RHEL family OS environment instead of Ubuntu (CentOS, AlmaLinux, Rocky Linux, etc.).

Also, I would like to test a bit more on my side to see if I can reproduce the issue. Could you please provide a bit more information about your environment, such as the virtualization platform, instance sizes, etc.?

madsholme commented 1 month ago

@kurokobo

You have gone way and beyond. Big thanks for taking the time. It was just a standard ubuntu 24 iso, i installed with next all the way, and only selected the ssh server part.. I even tried to reroll it same error.

Its on a hyper v, and specs are 40gb of space and 5gb of ram, and 16 cpu cores.. (i think its just what hyperv defaults to). Im just hyper v as a test before i depoloy it in a prod enviroment, to test it out.

i tried 2 times, but today i started it, and just let it sit all day, and now it seems to be working? suddenly it just said:

localhost : ok=90 changed=0 unreachable=0 failed=0 skipped=83 rescued=0 ignored=1

i could see it kept retrying bringing up the database.. i dident even change any thing... weird..

is there an easy way to add a execution node ? i need to execute jobs in diffrent vlans, so i was hoping to replace rundeck with this. From what i understand i need to create diffrent mesh ingress for each vlan, and add execution nodes to those? Im not asking for a guide, just a point in the right direction if possible?

kurokobo commented 1 month ago

@madsholme Thanks for updating, I just tried installing AWX on Ubuntu 24.04 on Hyper-V on Windows 11, but it was completed without any errors, so still I can't reproduce your issue in my lab.

localhost : ok=90 changed=0 unreachable=0 failed=0 skipped=83 rescued=0 ignored=1

The operator's task seems to have completed with failed=0, but is AWX running after all? Is the database still not up? If you can deploy CentOS on Hyper-V, it might work if you try that.

is there an easy way to add a execution node ?

As long as you prepare a VM from the RHEL family, it can be installed with a playbook, so it's not a very complicated procedure.

https://ansible.readthedocs.io/projects/awx/en/latest/administration/instances.html

From what i understand i need to create diffrent mesh ingress for each vlan, and add execution nodes to those?

Whether Mesh Ingress is necessary depends on the direction of TCP connection that can be allowed by the firewall along the path. If you can connect to the execution node outbound from AWX, then Mesh Ingress is not needed.

If you can only connect to the execution node inbound to AWX, then Mesh Ingress is necessary.

This is in Japanese, but I have written an explanation including the above diagram on my blog. I’m not sure how well machine translation will work, but it might be helpful: https://blog.kurokobo.com/archives/5141

madsholme commented 1 month ago

@kurokobo

One thing i did change was hyper v had default created the machine with dynamic memory. i undid that and gave it 5gb.. maybye that could be the cause?

But no matter what everything is working now, so its properly something odd on my end, no need to waste more of youre time looking into that.

also thanks for the links, and the explanation. I should be able to go without a a mesh ingress then. As i should be able to allow trafic from the awx to the node on a specific port. The auto translate on the article is good, so its a good resource aswell.. Thanks :)

kurokobo commented 1 month ago

Okay, thanks for updating! Glad to know my blog can help you 😃 I'm closing this issue, have fun with AWX 👍

kurokobo / awx-on-k3s