Goroutine Leak in Node Topology Informer due to missing informer queue cancellation

jakobmoellerdev commented 5 months ago

What happened:

When creating The Capacity Controller there is a Queue Shutdown happening for its queue. This leads to the shutdown causing the processNextWorkItem Queue to cancel correctly.

However, the Capacity Controller expects a Node Topology Informer. This informer is started in the same way and is compromised of the same processNextWorkItem Queue logic. However, the Node Topology Controller Queue never gets called for shutdown on context cancellation.

This leads to the fact that this controller never correctly shuts down and processes remaining work items and can lead to context hangups if waited for correctly.

What you expected to happen:

The Node Topology informer queue should be shut down similarly to the Capacity Informer queue.

How to reproduce it:

Instead of the current code:


        if *enableCapacity {
                  go topologyInformer.RunWorker(ctx)
            }
            // ...
            if capacityController != nil {
                  go capacityController.Run(ctx, int(*capacityThreads))
            }
            if csiClaimController != nil {
                  go csiClaimController.Run(ctx, int(*finalizerThreads))
            }
            provisionController.Run(ctx)

Properly run the controllers in a setup with a wait group:

            var wg sync.WaitGroup
        if *enableCapacity {
              wg.Add(1)
                  go func() {
                defer wg.Done()
                        topologyInformer.RunWorker(ctx)
                logger.Info("topology informer finished shutdown")
                  }
            }
            // ...

            if capacityController != nil {
              wg.Add(1)
                  go func() {
                defer wg.Done()
                        capacityController.Run(ctx, int(*capacityThreads))
                logger.Info("capacity controller finished shutdown")
                  }
            }
            if csiClaimController != nil {
              wg.Add(1)
                  go func() {
                defer wg.Done()
                        csiClaimController.Run(ctx, int(*finalizerThreads))
                logger.Info("capacity controller finished shutdown")
                  }
            }

            wg.Add(1)
            go func() {
         defer wg.Done()
                 provisionController.Run(ctx)
         logger.Info("provisioner controller finished shutdown")
            }            

        wg.Wait()

        logger.Info("all controllers finished shutdown")

instead of running the controllers like here. In the new code it is verified that the controllers actually exit, and there is no goroutine leak if the queue is stopped. However to stop the queue we will have to stop the queue from the given capacity controller. This is currently not done.

Anything else we need to know?:

The entire provisioner should be verified for goroutine leaks due to this finding as this is usually an easy to spot issue. I only noticed this because I manually started the controllers in a different embedded application and noticed that the controller did not shut down properly.

Environment:

Driver version: Applicable to all versions
Kubernetes version (use kubectl version): Applicable to all k8s versions
OS (e.g. from /etc/os-release): Applicable to All OS
Kernel (e.g. uname -a): irrelevant
Install tools: irrelevant
Others: N/A

pohly commented 5 months ago

Thanks for the detailed analysis. I haven't double-checked the reasoning, but it seems plausible.

However, is this a problem in practice? In other words, how important is fixing this?

Once started, the external-provisioner only shuts down by exiting. Leaking goroutines only affects unit testing, doesn't it?

jakobmoellerdev commented 5 months ago

AFAIK all current reconciliation queues started via go routine are simply dropped in flight and can lead to broken reconciles If i understand correctly. Thats because while one of the processNextWorkItem is running in Node Topology, CSI Claim and Capacity Controller, the Process might terminate in the middle of a reconciliation because of the goroutine leak.

So even though my case with the Node Topology informer is the least affected, we should introduce the wg to ensure no interrupted goroutines for the other reconciliations at the very least.

jakobmoellerdev commented 5 months ago

Maybe one more affected issue is that the controller exposed via pkg is actually not usable correctly because its not terminated correctly and relies on process exit to stop. Thats my (embedded and admittedly very strange) use case and how I spotted it. We want to reduce memory consumption on the edge and embed the provisioner directly.

jakobmoellerdev commented 5 months ago

@pohly I can prepare a PR for the fix, would that be desired?

pohly commented 5 months ago

Thats because while one of the processNextWorkItem is running in Node Topology, CSI Claim and Capacity Controller, the Process might terminate in the middle of a reconciliation because of the goroutine leak.

But it's not really leaking, is it? A real leak is when a process constantly starts new goroutines, thus increasing the total number over time until resources are exhausted. That's not the case here?

Maybe one more affected issue is that the controller exposed via pkg is actually not usable correctly because its not terminated correctly and relies on process exit to stop.

That's a different story. When used like that, proper shutdown may be desirable, depending on the caller.

I can prepare a PR for the fix, would that be desired?

Embedding the code wasn't intended, but a PR would still be useful I suppose, so yes.

jakobmoellerdev commented 5 months ago

But it's not really leaking, is it? A real leak is when a process constantly starts new goroutines, thus increasing the total number over time until resources are exhausted. That's not the case here?

A goroutine leak in Golang refers to a situation in which a Golang program generates and begins goroutines without proper oversight or orderly termination. This is happening here, even though the leak is not used to exhaust CPU / Memory in this case.

That's a different story. When used like that, proper shutdown may be desirable, depending on the caller.

Agreed, will prepare the PR due to that.

Embedding the code wasn't intended, but a PR would still be useful I suppose, so yes.

Just my 2 Cents: If this was never intended, then we should consider moving the controllers to internal instead of pkg. Additionally, we should think about wether this use case makes sense overall because the memory overhead for the CSI sidecars is significant enough for us to use them embedded.

pohly commented 5 months ago

A goroutine leak in Golang refers to a situation in which a Golang program generates and begins goroutines without proper oversight or orderly termination.

At lot of Kubernetes APIs have this problem. I don't like it either, but in practice it doesn't cause problems and thus doesn't get fixed.

If this was never intended, then we should consider moving the controllers to internal instead of pkg.

It's neither encouraged nor discouraged...

Additionally, we should think about wether this use case makes sense overall because the memory overhead for the CSI sidecars is significant enough for us to use them embedded.

This has come up before and there have been discussions around releasing a single sidecar, it just hasn't been done because it would be quite a bit of work to create a unified sidecar, then change the release and deployment process.

jakobmoellerdev commented 5 months ago

Ack to all of the above.

This has come up before and there have been discussions around releasing a single sidecar, it just hasn't been done because it would be quite a bit of work to create a unified sidecar, then change the release and deployment process.

Would this discussion be handled in SIG Storage? I would like to at least try initiating something in this direction again.

pohly commented 5 months ago

Yes, ask on #sig-storage Slack. There was a proposal in 2023, but I can't find it right now.

jakobmoellerdev commented 5 months ago

Closing as per discussion

kubernetes-csi / external-provisioner

Goroutine Leak in Node Topology Informer due to missing informer queue cancellation #1151