senthilrch / kube-fledged

A kubernetes operator for creating and managing a cache of container images directly on the cluster worker nodes, so application pods start almost instantly
Apache License 2.0
1.27k stars 119 forks source link

Reconciliation process with many images is slow #177

Open mbaynton opened 2 years ago

mbaynton commented 2 years ago

Our use case involves having nearly two thousand distinct images on each kubelet, and different images on different kubelets. We are evaluating kube-fledged as a component of how we can manage our image collections at this scale.

A finding we’ve discovered that is not already covered by other issues is that when we edit an existing ImageCache CRD of this size, it takes a few minutes to perform the reconciliation between the desired images and the images actually present, even if the actual change only added or removed one image.

It looks like this is likely attributable to this block, which adds all images in the modified CRD to a rate-limited work queue. The identification of whether the image is already present occurs later, inside the queue consumer. Computing a diff between the image list in the updated CRD and the image list in the node status upfront once, before pushing to the work queue, might improve responsiveness.

We could be open to working on this issue so that kube-fledged better meets our particular use case, but we wanted to file this issue as a first step to see if there is interest in supporting ImageCaches of this size in principle, and if you foresee any difficulties with the proposal to reconcile the CRD with the node status data upfront before pushing to the work queue.

omar-rs commented 2 years ago

Here are some additional notes related to the issue above.

Setup:

Test 1: Remove images from an imagecache

Test 2: Append images to the end of the imagecache

Test 3: Add images at the top of the imagecache list

senthilrch commented 2 years ago

@mbaynton @omar-rs : Thanks for reporting this issue and the in-depth analyses you performed with kube-fledged.

I am keen on improving the performance of kube-fledged to meet your particular use-case. The scenario of modifying an existing imagecache is not fully optimised for performance i.e. it is treated as reconciling a new imagecache so you see ALL the image pulls (and deletes) getting queued to the image manager routing.

It makes perfect sense to queue only the image pulls (and deletes) that are required. I'll come up with a proposal for this.