insitro / redun

Yet another redundant workflow engine
https://insitro.github.io/redun/
Apache License 2.0
510 stars 43 forks source link

Job array causes error message on older versions of k8s #65

Open dakoner opened 1 year ago

dakoner commented 1 year ago

The k8s executor depends on a feature added in k8s v1.24: https://kubernetes.io/docs/tasks/job/indexed-parallel-processing-static/

When I run a job on my EKS cluster using defaults (where max_array_size > 1), which is running v1.21, I see these errors (warnings?):

[redun] Executor[k8s]: Pod redun-job-d64219e107664faab6f1223c52909c0a-array-888pz is missing job-completion-index: {'kubernetes.io/psp': 'rafay-privileged-psp'}

The k8s jobs are all in Error state, and the workflow never finishes because it gets that error.

We already have code that should be detecting versions less than v1.21 https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L418 but I think these code path still execute: https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L478 and https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L771

To repro, I think you could use minikube to install v1.23 or earlier and then run redun in it. To fix, I think you could remove the warning at https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L771 and properly handle tasks that are missing that field.