signalfx / signalfx-k8s-metrics-adapter

Horizontal Pod Autoscaler custom/external metrics provider for Kubernetes that uses SignalFx as the backend
Apache License 2.0
9 stars 15 forks source link

Fix possible signalflow job leaks and introducing new option #29

Closed dloucasfx closed 1 year ago

dloucasfx commented 1 year ago

The adapter is leaking signalflow jobs, for active clusters, where HPA definitions are constantly added/removed/updated, the backend will eventually throttle the job creation.

There are 2 reasons that are leading to this behavior:

1- the stop backend call that the adapter is currently using has been broken for a while. When a job stop is issued, it fails silently; the job gets removed from the adapter, but stays running in the backend. With the current adapter code, the only way to stop running jobs is by disconnecting the client, ie: restarting the adapter. I reported this to our backend team and they are fixing it, however, the recommended way to end a job, is to use the detach call , which takes the websocket channel name as argument.

2- the job handle (aka: jobID) is received asynchronously from the api server after the job has been submitted and started. It's possible for the server to be delayed, hence the 10 seconds metadata timeout defined here . It's possible that the server be delayed longer than 10 seconds, especially when the token is getting throttled on jobs, this leads to a job leak as we can't issue a stop due to missing handle.

The solution to the above is to issue detach call instead of stop, I pushed 2 PRs on https://github.com/signalfx/signalfx-go to add such a support https://github.com/signalfx/signalfx-go/pull/186 and https://github.com/signalfx/signalfx-go/pull/187
detach uses the channel name which is generated at the client side and sent to the server, with this logic, we can solve item 2, ie: stopping jobs without a handle.

Finally, I introduced an option to make the metadata timeout value configurable and it's exposed through the helm chart