taimos / cfn-tail

Tail for AWS CloudFormation stack events
MIT License
24 stars 6 forks source link

Rate limit exceeded #2

Closed atkinsonm closed 6 years ago

atkinsonm commented 6 years ago

Occasionally when performing an update, cfn-tail will run into API throttling with its CloudFormation DescribeStackEvents calls. This happens most often when deploying multiple projects at the same time. The current behavior is that cfn-tail terminates with a failure message. We want to be able to recover from this issue so we can continue to receive output of the stack events until the stack create/update is complete and also to reduce noise from failed CI jobs.

image

atkinsonm commented 6 years ago

AWS Support confirmed our evidence that DescribeStackEvents calls are being throttled. This project does not specify retry options, so it is using the defaults. The maxRetries is set by CloudFormation and the retryDelayOptions are using the default. Some options:

  1. Set custom retryDelayOptions with a base higher than the default 100ms. CloudFormation resources can take time to spin up, so waiting longer between DescribeStackEvents calls is acceptable.
  2. Increase the maxRetries from the default (which appears to be 4 for CloudFormation).

With the current settings, the retries will be delayed to 100ms, 200ms, 400ms, and 800ms, after which point the command will fail. For CloudFormation stack events, I believe waiting up to 5-10 seconds may be acceptable since many resources will take longer than that time to complete, during which time the stack events log will remain the same. Therefore I believe the simplest change would be to configure AWS with a new base exponential backoff value.

hoegertn commented 6 years ago

This sounds great. Would you be able to submit a PullRequest?

atkinsonm commented 6 years ago

@hoegertn PR submitted 👍

hoegertn commented 6 years ago

released as 1.4.0

atkinsonm commented 5 years ago

@hoegertn, could you please reopen this issue? As my CI pipelines do more and more simultaneously, I am finding that this has crept up again. Some ideas on resolution:

  1. Increase the retryDelayOptions base backoff milliseconds with the logic that CloudFormation stack events will generally be updated infrequently.
  2. Add --retry-ms as an optional command line argument to allow a developer to supply their own backoff time delay in case they have an application known to trip the API throttling triggers. I like this option better since it is unopinionated and would still allow us to set a default value.