remove `Completed` pods

tiborsimko commented 5 years ago

Seeing again Completed pods as in https://github.com/reanahub/reana-job-controller/issues/101. This is good for debugging, but we should remember to clean this later.

$ kubectl get pods | grep ^batch
batch-serial-30b5b40a-f0de-45b6-9735-5d4418d87693-c4rzj   0/1       Completed   0          14m
batch-serial-3a770cfd-5fde-4e28-87d6-b98d10fdff55-8fh7v   0/1       Error       0          44m
batch-serial-61eab2c5-e4bf-4d56-8da6-95e306ba8f48-njjjf   0/1       Completed   0          29m
batch-serial-8ec42d82-f96e-4bf9-814f-eb8f0f259f79-54zpk   0/1       Completed   0          46m
batch-serial-d400e8ea-28f0-4bf0-926a-c4a5c2a337f6-wf8ms   0/1       Completed   0          15m
batch-serial-dacdf016-237d-405c-8fc0-622e7f076473-tw4qt   0/1       Completed   0          16m
batch-serial-e33b0a32-654a-42e7-ae58-59f6ca130718-2bvvp   0/1       Error       0          37m
batch-serial-ea616367-b0a2-492e-a9a8-7ae29a8af213-hspt8   0/1       Error       0          41m
batch-serial-efe722bf-8f9e-4f9a-8b45-2856a3d08af3-tr7r2   0/1       Completed   0          17m

(stemmed from https://github.com/reanahub/reana-workflow-engine-serial/pull/56#issuecomment-449398979)

Note that killing pods requires storing and exposing logs from workflow pods (and job pods) in an accessible place...

tiborsimko commented 5 years ago

See also:

Once we have these, we should be able to remove Completed workflow run pods as in https://github.com/reanahub/reana-job-controller/issues/101.

roksys commented 5 years ago

What about using https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#ttl-mechanism-for-finished-jobs ?

To use this feature we would need to upgrade Kubernetes to v1.12

tiborsimko commented 5 years ago

The CERN cluster is on Kubernetes v1.12 since December 2018 so it should be possible to upgrade indeed...

tiborsimko commented 5 years ago

It would be also good to check whether restartPolicy is Never for workflow run jobs when we are at it, I saw on my box a few days ago when I wanted to fully stop the cluster components and kill any running workflows that some workflow runtime pods were surviving and restarting themselves...

roksys commented 5 years ago

ttl_seconds_after_finished=200 works well, but as this feature is in alpha version, it needs to be enabled manually by adding --feature-gates flag.

minikube start --kubernetes-version="v1.12.0" --vm-driver=hyperkit --feature-gates="TTLAfterFinished=true"

I can confirm restartPolicy is set to Never and seems to work fine on my local cluster.

tiborsimko commented 5 years ago

Adding a single required flag is OK, if we need more flags so that the command line would be too long, we can provider a helper wrapper like:

$ reana-dev minikube-start

reanahub / reana-workflow-engine-serial

remove `Completed` pods #58