kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.5k stars 1.07k forks source link

keda-operator v2.13.0 leaks go routines #5448

Closed FrancoisPoinsot closed 9 months ago

FrancoisPoinsot commented 9 months ago

Report

After upgrading keda to 2.13.0 there seems to be a memory leak Looking at go_goroutines metric I see the number growing indefinitely. Confirmed at least above 30k. Here is a graph for go_goroutines. image

I have deployed keda in different cloud vendor. I only see this issue in GCP. Might be related to pubsub scalers I used only in GCP clusters.

Expected Behavior

memory/goroutine count remains somewhat constant. You can see very clearly on the graph above when the upgrade to v2.13.0 happens.

Actual Behavior

memory increases indefinitely

Steps to Reproduce the Problem

1. 2. 3.

Logs from KEDA operator

example

KEDA Version

2.13.0

Kubernetes Version

1.26

Platform

Google Cloud

Scaler Details

prometheus, gcp-pubsub

Anything else?

No response

zroubalik commented 9 months ago

Thanks for reporting. Could you please check whether you can see the leak also when you use only Prometheus scaler on GCP? So we can narrow down the possible problems. Thanks!

FrancoisPoinsot commented 9 months ago

With only Prometheus scalers, the go routines count is stable at 178 go routines.

FrancoisPoinsot commented 9 months ago

here are the goroutines from pprof.

goroutines.txt

zroubalik commented 9 months ago

Cool, thanks for confirmation. And to clarify, this doesn't happen with version < 2.13.0 ? If it is a regression, than we should be able to track down changes in GCP pubu sub scaler.

FrancoisPoinsot commented 9 months ago

I confirm this does not happen in v2.12.1

JorTurFer commented 9 months ago

Maybe it's something related with the changes in the gcp client?

JorTurFer commented 9 months ago

Do you see errors in KEDA operator logs? Maybe we are not closing well the client on failures? Could this be related with https://github.com/kedacore/keda/issues/5429? (as the scaler cache is being refreshed on each error)

JorTurFer commented 9 months ago

yeah, the new queryClient isn't closed, so if the scaler is being refreshed due to #5429, the connections aren't properly closed. I guess that it could be the root cause (I'll update my PR)