Retain pods for failed jobs a longer period

gocd / kubernetes-elastic-agents

Kubernetes Elastic agent plugin for GoCD

https://www.gocd.org

Apache License 2.0

34 stars 32 forks source link

Retain pods for failed jobs a longer period #102

Open Evesy opened 5 years ago

Evesy commented 5 years ago

2.0.0 introduced terminating a pod immediately upon job completion which has proved useful when running many different pipelines with elastic agents, as nodes' resources are freed up quicker which reduces the chances of nodes have to autoscale.

This has introduced some extra difficulties when troubleshooting failed jobs though since the pods are cleaned up immediately it leaves nothing left to debug. It could be useful to set an alternate grace period for pods whose assigned jobs have failed, to give the option to look around the pod.

arvindsv commented 5 years ago

That would still mean changing the elastic agent profile. I wonder if it can be made more dynamic. I'm imagining some kind of metadata which would tell it to not terminate immediately.

Related code is here. Of course, with this, the plugin would need to store the information that it needs to be retained till / cleaned up later.

ckaushik commented 5 years ago

This can be achieved by TerminationGracePeriod or preStopHook in Kubernetes. We can probably sleep for a variable amount of time in preStopHook. People can then choose to keep agents shorter / longer based on the config @arvindsv wdyt?

arvindsv commented 5 years ago

@ckaushik My concern is: The GoCD server will send only one event about job completion. If we miss that and don't terminate it, without keeping track of the fact that the event was sent, then it'll leave behind pods. So, either we keep track of the event and delay the time the plugin terminates the pod. Or, maybe you're suggesting that we somehow use the terminationGracePeriodSeconds option and let k8s delay the termination.

Anyway, however the implementation is, my concern is that we should terminate them eventually.

varshavaradarajan commented 5 years ago

Can't we pass along the job status to https://github.com/gocd/kubernetes-elastic-agents/blob/master/src/main/java/cd/go/contrib/elasticagent/requests/JobCompletionRequest.java#L29 at the time of termination to have it not terminate the pod that completed the job? The elastic profile can have a property (pod_retention_time?) that specifies how long after job completion the pod remains.

arvindsv commented 5 years ago

Yes. We would need to store that information, as you said, and make sure that the pod is terminated later.