kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.52k stars 1.08k forks source link

Increase operator resiliency to unexpected scaler failures #5622

Open cyrilico opened 7 months ago

cyrilico commented 7 months ago

Proposal

As a(n independent) follow-up to #5619 , I think it would be interesting to start a discussion on potential improvements to Keda operator's resiliency, more specifically in the case of unexpected/catastrophic scaler failures. In the linked issue, the problematic query caused an outage in our Keda operators which prevented all ScaledObjects in the cluster from operating correctly until my team was able to pinpoint the issue and remove that particular scaler configuration. While a quick fix for that specific scaler has been proposed, we worry that similar issues may arise in the future.

While we are not familiar enough with the codebase to immediately suggest potential paths, if you are able to provide some pointers and initial thoughts, we'd love to keep engaging and, if the opportunity arises, provide a contribution down the line πŸ™

Use-Case

A higher operator resiliency to scaler failures

Is this a feature you are interested in implementing yourself?

Maybe

Anything else?

No response

JorTurFer commented 7 months ago

This is an interesting point. We aim to prevent all the panics by code instead of just recovering them, but maybe we could recover panics on scaler metric requests: https://github.com/kedacore/keda/blob/1e1cfb11d6ca826d7c083e9aba730e08f3bd24f4/pkg/scaling/cache/scalers_cache.go#L125-L142

As scalers are the place where more contributions are made, they're also the place with more unexpected problems and although I think that we should avoid panics, maybe in this case it can make sense.

In the other hand, we have a really few panics because we try to cover all the cases and we almost achieve it. WDYT @zroubalik @dttung2905 ?

zroubalik commented 7 months ago

Yeah, we should avoid panics, I agree.

JorTurFer commented 7 months ago

Are you willing to open a PR with this recover @cyrilico ?

cyrilico commented 7 months ago

I'll gladly take a shot at it whenever I get some time πŸ™

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

cyrilico commented 3 months ago

not stale, just haven't had the time yet, go away bot