lwolf / kube-cleanup-operator

Kubernetes Operator to automatically delete completed Jobs and their Pods
MIT License
503 stars 109 forks source link

K8s alpha feature TTLAfterFinished is a good alternative #39

Closed silashansen closed 4 years ago

silashansen commented 4 years ago

This is not really an issue, more of a helpful comment.

Thank you for a great project - it's been very helpful!

Issues: We've been using this controller for a while, and it's been great for the most part. We had issues with pods that would complete due to other reasons than normal completion, such as OOMKilled. When that happens, the job would schedule a replacement pod immediately - which is as we want it. But because the cleanup-operator would react on the pod going into to the status of OOMKilled, it would go in and delete it and it's parent job, which would leave the replacement pod without a reference to a job and therefore never become a candidate for cleanup ever again. Agreed, pod's shouldn't die from OOM issues often, but it so happens that we had that happen a lot and therefore a lot of orphaned pods just filling up.

The new solution: However, we recently tried the TTLAfterFinished alpha feature and it works absolutely amazing and has taken over the previous dependency we had on this controller.

Who is it useful for? It's not possible to use for those running managed kubernetes, as it requires you to enable the feature flag in your cluster, but for those running their own clusters, it's a really good solution.

lwolf commented 4 years ago

thanks for your comment.

But because the cleanup-operator would react on the pod going into to the status of OOMKilled, it would go in and delete it and it's parent job, which would leave the replacement pod without a reference to a job and therefore never become a candidate for cleanup ever again.

I see two problems here.

  1. Bug in the operator - we shouldn't delete job if it has pods in healthy state.
  2. Deleting the job have to delete all the pods owned by it. It's default k8s behaviour. It seems like a problem on your side. Could you try manually delete the job and see if it deletes the pods.
    kubectl delete --help
    ...
    --cascade=true: If true, cascade the deletion of the resources managed by this resource (e.g. Pods created by a
    ReplicationController).  Default true.
    ...

    I hope to see TTLAfterFInished graduating to the stable, but it's in alpha state for 5 versions now. Most of the companies are prohibit the use of alpha features.