kubernetes-client / python

Official Python client library for kubernetes
http://kubernetes.io/
Apache License 2.0
6.56k stars 3.24k forks source link

Watch stream missing job completion events #2238

Open headyj opened 1 month ago

headyj commented 1 month ago

What happened (please include outputs or screenshots):

Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.

Here is the code, which is watching a job status and printing the logs:

w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
  if event['object'].metadata.name == jobName:
    logging.info(event)
    if event['type'] == "ADDED":
      logging.info("job %s created, waiting for pod to be running...", jobName)
    if event["object"].status.ready:
      pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
      logging.info("pod %s is ready", pods.items[0].metadata.name)
      for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
        print(line.decode(),end = '')
    if event["object"].status.succeeded:
      logging.info("Finished pod stream.")
      w.stop()
    if not event["object"].status.active and event["object"].status.failed:
      w.stop()
      logging.error("Job Failed")
      sys.exit(1)

Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:

[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]

The event is correctly updated on Kubernetes side, checking on k9s:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  35m   job-controller  Created pod: my-job-1716468085-9dq8d
  Normal  Completed         32m   job-controller  Job completed

What you expected to happen:

Job completion event should be catch and sent

How to reproduce it (as minimally and precisely as possible):

Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.

Anything else we need to know?:

Environment:

yliaog commented 1 month ago

Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.