senthilrch / kube-fledged

A kubernetes operator for creating and managing a cache of container images directly on the cluster worker nodes, so application pods start almost instantly
Apache License 2.0
1.24k stars 118 forks source link

Error on job with multiple pods #193

Open ChevronTango opened 1 year ago

ChevronTango commented 1 year ago

Whilst experimenting with kube-fledged we added and destroyed a number of nodes in our test cluster, and a handful of pods got stuck. Kube-fledged logged an error reporting that a job had multiple pods in it, and then never reran again. Whether this is a fluke of our setup or not I couldn't say.

I1125 08:32:14.739894 1 image_manager.go:472] Job my-cache-tld8k created (pull:- my-registry.com/my-image:latest --> ip-10-1-7-224, runtime: containerd://1.5.8)
E1125 08:37:14.777438 1 image_manager.go:241] More than one pod matched job my-cache-tld8k
E1125 08:37:14.778075  1 image_manager.go:324] Error from updatePendingImageWorkResults(): more than one pod matched job my-cache-tld8k

those are the last logs the controller ever prints out.

Does the controller have a liveness check that could detect this kind of crash and restart? Can the controller also handle jobs that have multiple pods, some stuck in terminating, and others stuck waiting for a node that has been destroyed? The latter is quite likely to occur in environments with frequent scale up and down. Would the controller be able to clear up all the jobs on a restart or would there be jobs left in the cluster forever?

Very much liking the app, so keen to help improve it for some of the above scenarios.

senthilrch commented 1 year ago

@ChevronTango : Thank you for reporting this issue.

The controller works in this fashion: There's a master routine and an image manager routine. Both the master and image manager communicate through work queues. Master places image pull/delete requests in a queue and image manager places image pull/delete responses in another queue.

In case the image manager encounters a situation where a Job happens to have multiple pods, it considers it as an error and stops further processing, without sending back a response. hence the controller gets stuck waiting for the response from image manager. I'll modify the login in image manager to log this error and continue processing and finally send a response back.

Yes , when the controller re-starts if there are any image pull/delete (dnagling) jobs, those are deleted. Also when it sees an ImageCache in processing status, the status is reset as well. So overall the controller is fairly resilient in this aspect.

You made a good point with the liveness/readiness probe. It's a good idea to add liveness/readiness probe to improve the robustness and observability of kube-fledged.