In my case, the nodes in k8s cluster are autoscaling by demand of jobs, the instance takes avg 5 min to be ready.
When some job start a new machine, rundeck don´t follow the log because the first loop set the var show_log = False in the line 119, then the next loop ignore the log and follow counting the retries but the job is running inside k8s. When finished the counting, the job is failed but still running in the cluster
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-qxtkt
INFO: kubernetes-wait-job: ========================== job log start ==========================
INFO: kubernetes-wait-job: =========================== job log end ===========================
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
ERROR: kubernetes-wait-job: Number of retries exceeded
INFO: kubernetes-wait-job: Job failed
Failed: NonZeroResultCode: Script result code was: 1
It´s the same case when the pod fail and k8s raises another pod to replaced, or if the node failed and the autoscaling raises another machine. Because it´s interrupted the loop and continue counting until "fail the execution" but it's still running in k8s
i make test with the command in job while true; do date;sleep 1 ; done
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-lvd65
INFO: kubernetes-wait-job: ========================== job log start ==========================
WARNING: kubernetes-wait-job: Pod is not ready, status: 400
INFO: kubernetes-wait-job: waiting for log
b'Tue Mar 22 16:18:05 UTC 2022'
b'Tue Mar 22 16:18:06 UTC 2022'
b'Tue Mar 22 16:18:07 UTC 2022'
b'Tue Mar 22 16:18:08 UTC 2022'
b'Tue Mar 22 16:18:09 UTC 2022'
b'Tue Mar 22 16:18:10 UTC 2022'
b'Tue Mar 22 16:18:11 UTC 2022'
INFO: kubernetes-wait-job: =========================== job log end ===========================
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
INFO: kubernetes-wait-job: Waiting for job completion
ERROR: kubernetes-wait-job: Number of retries exceeded
INFO: kubernetes-wait-job: Job failed
Failed: NonZeroResultCode: Script result code was: 1
Thats happend because k8s change the pod, and the var pod_name has the old name when try to get the logs in the line 94 get a 404 because this pod don´t exist anymore, the name has been change
i make the test in local open the console python
import os, sys
from kubernetes import client, config
from kubernetes.client.rest import ApiException
from kubernetes import watch
name="test-env"
namespace="rundeck"
config.load_kube_config()
core_v1 = client.CoreV1Api()
pod_list = core_v1.list_namespaced_pod(namespace,label_selector="job-name==" + name)
first_item = pod_list.items[0]
pod_name = first_item.metadata.name
print(pod_name)
# get the log
core_v1.read_namespaced_pod_log(name=pod_name,namespace=namespace)
# manual interrupt the machine/pod
# get 404 because pod_name has been change
core_v1.read_namespaced_pod_log(name=pod_name,namespace=namespace)
i reorganized the code to separate the task, to get the pod name when retries, so that get logs and status to finish the job. The result was this.
First start the maching
Get the log when is ready, from pod test-env-xb9zh
Kill the pod test-env-xb9zh
Get the log when is ready, from pod test-env-gpmhp
kill the node, and autoscaling another
Get the log when is ready, from pod test-env-kjr6k
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: ========================== job log start ==========================
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 1
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 2
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 3
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 4
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 5
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 6
WARNING: kubernetes-wait-job: Pod is not ready, status: 400
INFO: kubernetes-wait-job: waiting for log
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-xb9zh
b'Tue Mar 22 19:27:47 UTC 2022'
b'Tue Mar 22 19:27:48 UTC 2022'
b'Tue Mar 22 19:27:49 UTC 2022'
b'Tue Mar 22 19:27:50 UTC 2022'
b'Tue Mar 22 19:27:51 UTC 2022'
b'Tue Mar 22 19:27:52 UTC 2022'
b'Tue Mar 22 19:27:53 UTC 2022'
b'Tue Mar 22 19:27:54 UTC 2022'
b'Tue Mar 22 19:27:55 UTC 2022'
b'Tue Mar 22 19:27:56 UTC 2022'
b'Tue Mar 22 19:27:57 UTC 2022'
b'Tue Mar 22 19:27:58 UTC 2022'
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 7
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-gpmhp
b'Tue Mar 22 19:27:59 UTC 2022'
b'Tue Mar 22 19:28:00 UTC 2022'
b'Tue Mar 22 19:28:01 UTC 2022'
b'Tue Mar 22 19:28:02 UTC 2022'
b'Tue Mar 22 19:28:03 UTC 2022'
b'Tue Mar 22 19:28:04 UTC 2022'
b'Tue Mar 22 19:28:05 UTC 2022'
b'Tue Mar 22 19:28:06 UTC 2022'
b'Tue Mar 22 19:28:07 UTC 2022'
b'Tue Mar 22 19:28:08 UTC 2022'
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 8
WARNING: kubernetes-wait-job: Pod is not ready, status: 500
INFO: kubernetes-wait-job: waiting for log
WARNING: kubernetes-wait-job: Pod is not ready, status: 500
INFO: kubernetes-wait-job: waiting for log
WARNING: kubernetes-wait-job: Pod is not ready, status: 404
INFO: kubernetes-wait-job: waiting for log
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-kjr6k
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 9
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-kjr6k
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 10
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-kjr6k
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 11
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-kjr6k
INFO: kubernetes-wait-job: Waiting for job completion
WARNING: kubernetes-wait-job: An error occurred - retries: 12
WARNING: kubernetes-wait-job: Pod is not ready, status: 400
INFO: kubernetes-wait-job: waiting for log
INFO: kubernetes-wait-job: Fetching logs from pod: test-env-kjr6k
b'Tue Mar 22 19:32:00 UTC 2022'
b'Tue Mar 22 19:32:01 UTC 2022'
b'Tue Mar 22 19:32:02 UTC 2022'
b'Tue Mar 22 19:32:03 UTC 2022'
In the normal case, when the node is running, rundeck watch the job and runs it, the change only affect the cases aforementioned
In my case, the nodes in k8s cluster are autoscaling by demand of jobs, the instance takes avg 5 min to be ready.
When some job start a new machine, rundeck don´t follow the log because the first loop set the var
show_log = False
in the line 119, then the next loop ignore the log and follow counting the retries but the job is running inside k8s. When finished the counting, the job is failed but still running in the clusterIt´s the same case when the pod fail and k8s raises another pod to replaced, or if the node failed and the autoscaling raises another machine. Because it´s interrupted the loop and continue counting until "fail the execution" but it's still running in k8s
i make test with the command in job
while true; do date;sleep 1 ; done
Thats happend because k8s change the pod, and the var
pod_name
has the old name when try to get the logs in the line 94 get a 404 because this pod don´t exist anymore, the name has been changei make the test in local open the console python
i reorganized the code to separate the task, to get the pod name when retries, so that get logs and status to finish the job. The result was this.
In the normal case, when the node is running, rundeck watch the job and runs it, the change only affect the cases aforementioned