Open cyrilico opened 7 months ago
This is an interesting point. We aim to prevent all the panics by code instead of just recovering them, but maybe we could recover panics on scaler metric requests: https://github.com/kedacore/keda/blob/1e1cfb11d6ca826d7c083e9aba730e08f3bd24f4/pkg/scaling/cache/scalers_cache.go#L125-L142
As scalers are the place where more contributions are made, they're also the place with more unexpected problems and although I think that we should avoid panics, maybe in this case it can make sense.
In the other hand, we have a really few panics because we try to cover all the cases and we almost achieve it. WDYT @zroubalik @dttung2905 ?
Yeah, we should avoid panics, I agree.
Are you willing to open a PR with this recover @cyrilico ?
I'll gladly take a shot at it whenever I get some time π
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
not stale, just haven't had the time yet, go away bot
Proposal
As a(n independent) follow-up to #5619 , I think it would be interesting to start a discussion on potential improvements to Keda operator's resiliency, more specifically in the case of unexpected/catastrophic scaler failures. In the linked issue, the problematic query caused an outage in our Keda operators which prevented all ScaledObjects in the cluster from operating correctly until my team was able to pinpoint the issue and remove that particular scaler configuration. While a quick fix for that specific scaler has been proposed, we worry that similar issues may arise in the future.
While we are not familiar enough with the codebase to immediately suggest potential paths, if you are able to provide some pointers and initial thoughts, we'd love to keep engaging and, if the opportunity arises, provide a contribution down the line π
Use-Case
A higher operator resiliency to scaler failures
Is this a feature you are interested in implementing yourself?
Maybe
Anything else?
No response