Serverless Rate exceeded for Cloudformation API

Fiiii commented 4 years ago

Hi there! I have an issue connected with throttling on CloudFormation API. Screenshot 2020-08-03 at 11 51 47

Right now there is:

26 functions
3 dynamo tables, 1 sqs queue.
sum: 160 cf resources

The issue occurs on the serverless deployment

  ServerlessError: An error occurred: 
BroadcastToClientsEventSourceMappingDynamodbChannelsMessagesTable - Rate exceeded for operation 'Rate exceeded (Service: Lambda, Status Code: 429, Request ID: b8b7796b-e9d7-4c84-9615-7d2a98c1c85b, Extended Request ID: null)
'..node_modules/serverless/lib/plugins/aws/lib/monitorStack.js:125:33
  From previous event:
      at AwsDeploy.monitorStack (node_modules/serverless/lib/plugins/aws/lib/monitorStack.js:28:12)
      at node_modules/serverless/lib/plugins/aws/lib/updateStack.js:115:28

I'm pasting cloudformation error. Its always the same on sqs and dynamodb stream events mapping.

This is my investigation:

CloudFormation resources limit ~160 - checked
Resources names limit (64) - I renamed to resources to the shortnames & error still occurs randomly - checked Core issue always comes from EventSourceMapping resource creation. ie. (BroadcastToClientsEventSourceMappingDynamodbChannelsMessagesTable - Rate exceeded for operation 'Rate exceeded (Service: Lambda, Status Code: 429)
Error occurs only on "new stack creation" - all resources are triggered to deploy and then cloudformation returns rate exceed response on EventSourceMapping resource - checked
Deploying new stack needs to comment 50% of the functions in the resources files - checked
Splitting plugin doesn't help in type/group mode

serverless.yml

```yaml broadcastToClients: handler: bin/broadcastToClients package: include: - bin/broadcastToClients events: - stream: type: dynamodb batchSize: 1 enabled: true startingPosition: LATEST arn: Fn::GetAtt: - channelsMessagesTable - StreamArn channelBackup: handler: bin/channelBackup package: include: - bin/channelBackup events: - sqs: exist: true arn: Fn::GetAtt: - backupQueue - Arn ```

deploy command

``` sls deploy --verbose --stage xyz ```

 Node Version:              12.18.1
 Framework Version:         1.73.1
 Plugin Version:            3.6.14
 SDK Version:               2.3.1
 Components Version:        2.31.6

Installed version

1.73.1

medikoo commented 4 years ago

@Fiiii rate exceeded happens when you issue too many queries in given time frame, and Framework by default does retries on such errors (by default 4 retries, and on fifth it crashes).

Can you confirm you see retries in logs? You may also lift number of retries by tweaking SLS_AWS_REQUEST_MAX_RETRIES env var.

Fiiii commented 4 years ago

@medikoo I set it as a 10 and still the same :) + I can confirm that I receive debug logs

medikoo commented 4 years ago

@Fiiii I don't think it's not an issue on Framework side. If you believe that your request are rejected without a valid reason you should contact AWS support.

Still, if you feel that number of requests can be optimized on Framework side we're also open for suggestions

Fiiii commented 4 years ago

tbh I changed it to 20 now, and it looks like deploy is succeed - I will test it now :) and close the issue if it works :) thanks @medikoo !

*edit: unfortunately errors still occurs :/

adrianbanasiak commented 4 years ago

Hi guys, I have exactly same issue which doesn't happen before. My creation of new stack (new development environment) is constantly breaking at the same point with following error:

An error occurred: EventsDashconsumerEventSourceMappingSQSEventsQueue - Rate exceeded for operation 'Rate exceeded (Service: Lambda, Status Code: 429, Request ID: cd6e85e8-5115-4bb3-8c9a-7746c112ca86, Extended Request ID: null)'..

I am deploying in new region, I've checked AWS quotas but didn't find anything unusual.

It also happens on latest Serverless: 1.78

herebebogans commented 4 years ago

These are not serverless throttling api call errors you are seeing - they are on the cloudformation AWS side. You will see simliar errors if you try to create to many concurrent Dynamo tables or Kinesis streams in a CF template.

adrianbanasiak commented 4 years ago

These are not serverless throttling api call errors you are seeing - they are on the cloudformation AWS side. You will see simliar errors if you try to create to many concurrent Dynamo tables or Kinesis streams in a CF template.

Of course they ar not, it states in the error message that they are coming from AWS.

I am reporting that because in my opinion Serverless should be able to work around such cases or tell user what to check or modify in order to deploy its stack.

medikoo commented 4 years ago

I am reporting that because in my opinion Serverless should be able to work around such cases

Serverless does that with retry logic, which you can fine tune as I explained above. It's all we can do. Rest is in hands of AWS

claydanford commented 4 years ago

I ran into this very issue today.

adrianbanasiak commented 4 years ago

@medikoo I can't agree because Serverless is a framework and framework is abstracting provider specific details. I've tried with SLS_AWS_REQUEST_MAX_RETRIES set even to 40 and it doesn't change anything. Still unable to create stack which was previously created multiple times. Maybe there is a way to specify delay between retries?

herebebogans commented 4 years ago

Tuning any retry logic in serverless is not going to fix these errors. It is not the AWS api calls that serverless is issuing that are being throttled. Cloudformation is surfacing these throttling errors from the AWS services themselves.

Some of these limits are documented - some are not. An example of one I've run into in the past with Kinesis Streams.

https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

You receive a LimitExceededException when making a CreateStream request when you try to do one of the following: Have more than five streams in the CREATING state at any point in time.

It's possible to do some DependsOn ordering in cloudformation to work around some of these.

The original post seems to indicate they are hitting some limit on the number of concurrent dynamo stream event mappings. I can't find any documentation on this one I'd suggest they raise an AWS ticket to see if this is the case

medikoo commented 4 years ago

@medikoo I can't agree because Serverless is a framework and framework is abstracting provider specific details

We're open for solution proposal. Still as @herebebogans explained, I hardly see any possible solution to this that can be done on Framework grounds (but I may miss something, so again we're open for suggestions)

adrianbanasiak commented 4 years ago

Sure, I am waiting for AWS support response. Will follow up when I get something back.

Richie-Cluno commented 4 years ago

Was working fine until last week and we are facing the exact same issue:

ProcessSfRecordsChangedEventSourceMappingSQSSfRecordsChangedQueue Rate exceeded for operation 'Rate exceeded (Service: Lambda, Status Code: 429, Request ID: 1115cc57-412a-40c6-8207-a49f91761fcb, Extended Request ID: null)'.

swarnalatha935 commented 4 years ago

Sure, I am waiting for AWS support response. Will follow up when I get something back.

@adrianbanasiak Have you heard anything from them?

adrianbanasiak commented 4 years ago

@swarnalatha935 to be honest, no. Only some generic answers not related to the source of the question. They are throwing me for premium support which I'll explore after my vacations.

herebebogans commented 4 years ago

We've also started getting similar errors intermittently when deploying stacks with multiple kinesis event mappings to a handler. I'll raise with our TAM and see if I can get some answers for all.

swarnalatha935 commented 4 years ago

@adrianbanasiak and @herebebogans , I recently contacted to AWS support team regarding the issue. here is the response i got from them.

"Thank you for contacting AWS Premium Support. My Name is XXXXX , and I will be glad to assist! I understand that a CFN stack is failing due to error message related to exceeding a specific API quota when creating an event source mapping for a Lambda function. Please correct me if I am wrong.

I went ahead and reviewed your last conversation with my colleague and can confirm that the way to remediate this issue is to use the “DependsOn” attribute [1] on the CFN resource. This way the API operations performed by CFN on the backend are serialized so that you do not incur in exceeding any API quotas.

Regarding the exact number for this API quota being breached, this information do not seems to be publicly available [2] at the moment so I have no visibility over that. Note that even by not knowing this exact quota limit, implementing the solution provided will remediate the issue you are facing.

Regarding how to determine how many times this specific API call “CreateEventSourceMapping” is being made by other CFN stacks. I would suggest to look at CloudTrail events and filter for that specific API call. You can set “Event Name” as “CreateEventSourceMapping” and you will see the invocations performed when deploying the CFN stack."

I resolved the issue by setting the serverless-dependson-plugin in serverless.yml file and added the version "serverless-dependson-plugin": "^1.1.2", in pacakge.json . I am no longer getting the rate exceeded errors now. here is the reference link for that https://www.npmjs.com/package/serverless-dependson-plugin.

Hope this helps......

herebebogans commented 4 years ago

Judging by the number of people raising this recently - I'm assuming there's been some recent change to concurrent creations of event source mappings. I'll see if I can get some info from our TAM.

herebebogans commented 4 years ago

@swarnalatha935 thanks - I was not aware of that plugin.

claydanford commented 4 years ago

I am grateful for the information, but the depends on plugin is a poor solution. We need something better from AWS.

aralmpa commented 4 years ago

I believe that many colleagues having this issue are probably also using the serverless-plugin-split-stacks which offers a configuration option to control the rate at which the stacks are deployed. I guess also a rather temporary (or "poor") solution but I am in need of a fast way to unblock the development process for now.

I started using this option and it seemed to be working or at least I was able to bypass the rate exceeded error BUT after some time (and / or deployments) another weird error / behavior emerged: Cloudformation raised a 409 error complaining that SQS event source mappings already existed for some function - this error had appeared recently without using the spit-stack configuration option.

The point of my post is:

maybe the split-stack plugin configuration option might help somebody
since this started happening quite recently it probably reflects a change AWS-side that produces erroneous behavior during event source mapping setup (SQS in particular) in Lambda service

I also opened tickets with AWS support - if I get more info I ll share here

swarnalatha935 commented 4 years ago

@aralmpa I too got the event source mappings already existed errors, to resolve those errors I did use "aws lambda delete-event-source-mapping --uuid 'xxxxxxxxxxxxxxxx' --region xxxxx " cli command to delete those mappings.

herebebogans commented 4 years ago

@swarnalatha935 It's not uncommon for AWS to have different API limits per region.

herebebogans commented 4 years ago

I'm consistently getting the error now with new stacks that involve SQS event source mappings. Same stacks deployed fine previously.

I've raised a support ticket + asked our TAM to follow it up. Will update!

herebebogans commented 4 years ago

@pyaesone17 I don't think it's related to serverless version.

When I diff the cloudformation-template-update-stack.json generated between latest release and a release I was using in the past before the issue appeared (1.74.1) there are no differences in any of the Lamdba or SQS resources & event mappings that you would expect to see if the serverless generated CF was the cause.

I've also noticed the issue is intermittent - the stack will deploy fine on some occasions and other times fail with the rate limiting error.

Edit - Just tried again with 1.74.1 and same rate limiting error.

medikoo commented 4 years ago

that the way to remediate this issue is to use the “DependsOn” attribute [1] on the CFN resource.

I believe then, that we should add this in Framework. PR with that is welcome!

pyaesone17 commented 4 years ago

@pyaesone17 I don't think it's related to serverless version.

When I diff the cloudformation-template-update-stack.json generated between latest release and a release I was using in the past before the issue appeared (1.74.1) there are no differences in any of the Lamdba or SQS resources & event mappings that you would expect to see if the serverless generated CF was the cause.

I've also noticed the issue is intermittent - the stack will deploy fine on some occasions and other times fail with the rate limiting error.

Edit - Just tried again with 1.74.1 and same rate limiting error.

Yes Agreed, it was just intermittent, the error pop up again.

claydanford commented 4 years ago

that the way to remediate this issue is to use the “DependsOn” attribute [1] on the CFN resource.

I believe then, that we should add this in Framework. PR with that is welcome!

@medikoo, This issue would still come up if deploying a normal cloudformation stack so this does not make sense to implement such a thing into serverless, and would cause issues slowing down deployments. The reference here is to daisy chain function deployments, instead of parallel. This is something that should be fixed by AWS.

herebebogans commented 4 years ago

I got an update from our TAM that the issue has been reported by a number of customers and is being investigated by the Lambda service team.

medikoo commented 4 years ago

@medikoo, This issue would still come up if deploying a normal cloudformation stack so this does not make sense to implement such a thing into serverless, and would cause issues slowing down deployments. The reference here is to daisy chain function deployments, instead of parallel. This is something that should be fixed by AWS.

@claydanford that's a very valid point, thanks for sharing. In light of that let's wait for AWS to fix it on their side

beanaroo commented 4 years ago

We were consistently getting this issue for a deployment too, on a Kinesis stream handler. Using Cloudtrail + Athena, we identified another deployment's role producing a lot of DescribeStream throttles.

SELECT eventname,
         errorcode,
        eventsource,
        awsregion,
         useragent,
         useridentity.sessioncontext.sessionissuer.arn,
        COUNT(*) count
FROM cloudtrail_logs
WHERE errorcode = 'ThrottlingException'
        AND eventtime
    BETWEEN '2020-08-23T03:00:08Z'
        AND '2020-08-24T08:15:08Z'
GROUP BY  errorcode,awsregion, eventsource, useragent, eventname, useridentity.sessioncontext.sessionissuer.arn
ORDER BY  count desc

Deleting that stack allowed us to continue with deployment before redeploying the other stack.

Not entirely sure if this is related to the above, but we experienced the same error.

avazula commented 4 years ago

We used serverless-dependson-plugin to bypass this issue (deploying within a VPC) and it would work just fine, but since Sunday (2020-08-23) it seems to be working properly without it again? Does anyone have any news on this?

herebebogans commented 4 years ago

I had word from our TAM that a release was scheduled for the issue on Sunday evening (US) for US regions and is being rolled out to other lambda regions progressively. I cant pass on the limit for concurrent lambda event source mapping creations as they don't wish to publish this info but the number mentioned sounded very reasonable.

Reporters in this thread please retry deploying stacks that had the issue!

EDIT - sorry I got a msg this morning that the fix was NOT deployed yet.

marcusirgens commented 4 years ago

We ran into this issue this morning, and it somehow resolved itself some time during the day, @herebebogans.

sergioflores-j commented 4 years ago

We ran into this issue this morning, and it somehow resolved itself some time during the day

Same here, we've re-runned all our jobs and "magically" it worked

claydanford commented 4 years ago

I received this from support. No issues in the last few days.

This is *** from AWS Premium Support again. I hope you are well today! I am reaching out again to provide an update on the current status of investigating the intermittent 429 errors experienced when updating the AWS::Lambda::EventSourceMapping in your CloudFormation template.

After working with some engineers on the Deployment team who cover CloudFormation in greater depth, it appears that this is related to an ongoing issue that has been brought to the attention of the internal Lambda and CloudFormation teams and is centered around how CloudFormation is handling the retry logic for the EventSourceMapping calls. When these exceptions are raised, they remain internal to the CloudFormation service and therefore cannot be seen in the CloudTrail logs for Lambda.

When reviewing the events from the CloudFormation template, I noticed that after the initial update failures, the resource was able to update successfully several times afterward. As the internal teams are working on any necessary corrections, I would like to ask if you have noticed this behavior in the last few days? If there are any new examples of the 429 errors on the AWS::Lambda::EventSourceMapping calls, I would like to kindly ask for the request ID(s) for further troubleshooting and comparison.

While I cannot provide an exact ETA on when the internal teams will complete any necessary corrections, please rest assured that the teams are working to fix this behavior and I will update you promptly when there are new developments.

aralmpa commented 4 years ago

I (and I guess everybody else) also received a similar response from support. I have been deploying my stack for 2 days now without facing the issue anymore so it seems that AWS has mitigated the issue.

Unfortunately previous stack resources (SQS event source mapping) that were created when the 429 issue appeared are stuck in an erroneous / hidden state and need to be removed manually using the aws cli as @swarnalatha935 correctly (and helpfully!) mentioned further above.

herebebogans commented 4 years ago

I had feedback from our TAM that the rollout of the fix was completed to all lambda regions at end of September. The stacks where I was having this issue now deploy consistently!

For people that were having this issue I'd also suggest you also check

aws lambda list-event-source-mappings and check for ESMs with "Problem : function not found" and then delete them with aws lambda delete-event-source-mapping.

We had a number of these in our accounts from failed previous deployments and they were preventing deployment of the new stack.

serverless / serverless

Serverless Rate exceeded for Cloudformation API #8040