The adapter is leaking signalflow jobs, for active clusters, where HPA definitions are constantly added/removed/updated, the backend will eventually throttle the job creation.
There are 2 reasons that are leading to this behavior:
1- the stop backend call that the adapter is currently using has been broken for a while. When a job stop is issued, it fails silently; the job gets removed from the adapter, but stays running in the backend. With the current adapter code, the only way to stop running jobs is by disconnecting the client, ie: restarting the adapter.
I reported this to our backend team and they are fixing it, however, the recommended way to end a job, is to use the detach call , which takes the websocket channel name as argument.
2- the job handle (aka: jobID) is received asynchronously from the api server after the job has been submitted and started.
It's possible for the server to be delayed, hence the 10 seconds metadata timeout defined here . It's possible that the server be delayed longer than 10 seconds, especially when the token is getting throttled on jobs, this leads to a job leak as we can't issue a stop due to missing handle.
The adapter is leaking signalflow jobs, for active clusters, where HPA definitions are constantly added/removed/updated, the backend will eventually throttle the job creation.
There are 2 reasons that are leading to this behavior:
1- the
stop
backend call that the adapter is currently using has been broken for a while. When a jobstop
is issued, it fails silently; the job gets removed from the adapter, but stays running in the backend. With the current adapter code, the only way to stop running jobs is by disconnecting the client, ie: restarting the adapter. I reported this to our backend team and they are fixing it, however, the recommended way to end a job, is to use thedetach
call , which takes the websocket channel name as argument.2- the job handle (aka: jobID) is received asynchronously from the api server after the job has been submitted and started. It's possible for the server to be delayed, hence the 10 seconds metadata timeout defined here . It's possible that the server be delayed longer than 10 seconds, especially when the token is getting throttled on jobs, this leads to a job leak as we can't issue a
stop
due to missing handle.The solution to the above is to issue
detach
call instead ofstop
, I pushed 2 PRs on https://github.com/signalfx/signalfx-go to add such a support https://github.com/signalfx/signalfx-go/pull/186 and https://github.com/signalfx/signalfx-go/pull/187detach
uses the channel name which is generated at the client side and sent to the server, with this logic, we can solve item 2, ie: stopping jobs without a handle.Finally, I introduced an option to make the metadata timeout value configurable and it's exposed through the helm chart