AWX process slow to start when running a template

09cicada commented 2 months ago

Environment

K3S version: v1.25.4+k3s1 (0dc63334)

OS: Rocky Linux 8.7
Kubernetes/K3s: 1.25
AWX Operator: 2.2.1

Description

Hello Mr Kurokobo. Question on how to troubleshoot performance when starting a job/template. When I first deployed AWX, when synchronizing projects and running templates, the process ran relatively quickly in comparison to recently. When a template is run, a pod/automation-job container is created. I notice that this takes 2 to 4 minutes at times before any output occurs.

When this happens, I run a kubectl get all -n awx and I see the container in the ContainerCreating state for long periods. For example pod/automation-job-3583-shdjp 0/1 ContainerCreating

Do you know how I can troubleshoot this specific issue? I looked at your troubleshooting guide but I did not see anything specific to this issue. If I missed that I apologize ahead of time.

Thank you

kurokobo commented 2 months ago

Hi, could you please gather Events section from kubectl describe command for your automation job pod, when the issue occurred?

kubectl -n awx describe pod automation-job-3583-shdjp

Maybe you can get some events at the bottom of the output like this:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  37s   default-scheduler  Successfully assigned awx/automation-job-1-vc4c4 to kuro-c9s01.krkb.lab
  Normal  Pulling    37s   kubelet            Pulling image "quay.io/ansible/awx-ee:latest"
  Normal  Pulled     8s    kubelet            Successfully pulled image "quay.io/ansible/awx-ee:latest" in 29.325s (29.326s including waiting)
  Normal  Created    8s    kubelet            Created container worker
  Normal  Started    8s    kubelet            Started container worker

I suspect that your pod takes much time to pulling image from container registry. In above example, pulling image takes 29 seconds (see the Message column for the line Pulled, or calculate difference between the Age for Pulling(37s) and Pulled(8s)). You can see which events takes much time until Created event is recorded.

If the pulling image is taking a long time, there is not much that can be done.

Reduce the chances to pull container image by changing pull policy to Missing for your EE for your Job Template
Speed up the connection to the container registry
Increase K3s storage capacity /var/lib/rancher to reduce the removal cached images by garbage collection by kubelet (see this official docs)
Store your EE images on the private container registry in the same network, or on the same K3s host and make AWX to use EE image from container registry

09cicada commented 2 months ago

Hello Mr. Kurokobo, spot on, it was indeed the pulling of awx-ee:latest Normal Pulling 2m7s kubelet Pulling image "quay.io/ansible/awx-ee:latest"

I am going to change the pull policy to Missing. I will also add some space to /var/lib/rancher I really appreciate the advice and help. I will close this and thank you once again.

kurokobo / awx-on-k3s

AWX process slow to start when running a template #358

Environment

Description