Closed nixwiz closed 4 years ago
We should definitely be using exponential backoff for asset downloads.
While I think that having exponential backoff is something that is needed I think it's needed for a very different reason. Exponential backoff makes sense for errors such as network, auth, etc but I think checksum mismatch is a special case needing its own logic even coupled with an exponential backoff.
A checksum mismatch essentially falls into several scenarios:
The only scenario from the above I can think of is where fully automating retries would affect change without a direct action to sensu is if you had a compromised asset which prevents you from downloading and you have pushed a sanitized package to replace it. Personally I would recommend to "yank" the compromised package/asset rather than replace it because package managers generally won't be able to detect that there was a change since the version constraints matched what is already installed. I am not sure if sensu currently verifies the checksum of the installed contents and decide if it needs to unpack it again though my gut tells me that is unlikely.
I think it might make sense to store and later clear some state when an asset is unable to download due to checksum mismatch. We would need to be careful to not accidentally DoS ourselves with a bug but I think that might be a better path forward specifically for this type of issue.
@majormoses I like this idea. If people did want to force the replacement, they could always delete their asset cache (albeit at the cost of bandwidth).
Let's implement exponential backoff :) The backoff value might differ depending on the resource type.
Update: After working out the kinks of implementing exponential backoff for asset downloads, we discovered that it would have large implications on the fundamental concepts of the backend pipeline, specifically with regard to how handlers are queued for execution.
In the case of abusive asset downloads, exponential backoff would certainly limit the amount of times that an asset would be retrieved, however in doing so would create a backlog of handler executions that could have piled up in the meantime. As an operator, if I noticed that this was due to a misconfiguration and then made the appropriate change, that change would not be persisted in the pipeline event until the execution queue was cleared after an unknown amount of time or backend restart.
This issue only persistent in backend assets, since agent assets (checks and hooks) have a concept of being in progress, so subsequent check requests are dropped if the original check is still in progress. This behavior poses a conflict for backend resources such as handler assets, because there are cases where dropping events would be beneficial (ex. stale slack alerts) and there are cases where dropping events would be disadvantageous (ex. time series metrics).
It would not be wise to alter the current flow of the pipeline, so for that reason we will not implement exponential backoff to resolve this issue (see WIP branch). Rather, we will attempt a rate limiter to satisfy the request.
cc/ @echlebek @palourde
tl;dr Implement rate limiting rather than exponential backoff and table a discussion for backend pipeline queueing.
@nikkixdev I think that makes sense but I only see this as a partial mitigation. From what you describe you might have multiple fetchers (1 per agent and backend) are you are rate limiting each one but at a non central choke point. So a large number of clients or misconfigured assets still creates a stampede. Over time it does not improve the situation it just means a steady amount of requests from a "client". On the other hand an exponential backoff will over time send less requests which helps protect against DoS. I have seen many outages prolonged by applications/services not having a way to back off properly.
Do we have the concept of multiple queues? if so maybe splitting the events up based on certain event types and/or other data points could help with this. I am thinking this way we could choose to have different retry logic, buffer sizes, data retention, etc as well as it being easier to make such a refactor decision for a specific scenario or component.
I have not thought this out all the way but maybe we could create some sort of an internal (aggregate like?) check that we could surface all of this to an operator to allow them to quickly identify the problem. I think we need to find a way to quickly surface this as well as provide some mitigations.
I agree the mitigation is partial. Here's why we have chosen this path for the time being:
While exponential back-off is a no-brainer for agents, it's a little more complicated for handler assets, which are downloaded by the backend, and right now we only have a single kind of asset downloader.
Implementing exponential back-off for handlers is a little more delicate given the more parallel nature of handlers. Due to the pipeline design, a small number of mis-configured handlers could create havoc in an otherwise working sensu installation, choking the pipeline with a cycle of exponential back-off, failure, and retry. Handlers can't be allowed to block for long, or they will create a backlog of work that may never be finished, assuming events are arriving regularly. On the backend, it is far easier to control the rate of requests for assets with a simple rate limiter that's applied at the application level.
On the agent, this is not an issue. It's fine to exponentially back off, and try forever. If check requests arrive while this retry process is occurring, they will not be scheduled, as there will already be a pending request.
It's because of the complexities of handlers on the backend that we chose to implement an overly simple solution in the asset downloader for now. I don't think this precludes any kind of exponential backoff being added to the agent, and it also doesn't preclude improvements to how handlers are executed in parallel either. It's just the simplest approach we could think of right now, that we know won't have any undesirable effects.
Do we have the concept of multiple queues?
Right now, there are very few explicit queues in sensu-go, and no full-featured ones. In particular, our current pipeline execution system is simple but naive, and does not have explicit cancellation semantics. Users have no way to inspect or administer the implicit queue of handlers, mutators, or filters that are trying to execute.
maybe we could create some sort of an internal (aggregate like?) check that we could surface all of this to an operator
Regardless of how it's implemented, we need a way to provide observability into pipeline actions, and possibly give people a way to delete pipeline execution jobs, if we decide that they should work in a blocking way.
The tl;dr:
I copy/pasted the wrong sha512sum for an asset (a handler asset, to be specific). When the backend was attempting to download the asset it was failing with the appropriate error, however it was continually trying to download the asset such that it triggered github's abuse detection mechanism. It was just under an hour once I realized my mistake, but one backend server had attempted this download over 1300 times in that timeframe.
Expected Behavior
This mismatch should not generate this amount of traffic.
Also, it would be really, really nice if items like this created events (similar to what is requested in #2894) to surface themselves outside of the logs.
Current Behavior
Too many retries in a short amount of time if an asset is misconfigured.
Possible Solution
Exponential backoff for these failures?
Steps to Reproduce (for bugs)
Context
Once fixed, It kept me from downloading the proper asset until the github abuse detection cleared.
Your Environment