guardian / ophan-housekeeper

Lambda to remove Ophan Email Alerts for bouncing email addresses
0 stars 1 forks source link

Fix bug loading AWS credentials that stopped alert-deletion & SNS #7

Closed rtyley closed 4 years ago

rtyley commented 4 years ago

AWS placed our SES account under review (giving us notice that they could block our ability to send email :email: :skull: ) on Saturday 15th February:

Your current bounce rate is 10.63%. We measured this rate over the last 10,028 eligible emails* you sent. Our analysis covers the last 4.3 days.

image https://logs.gutools.co.uk/s/ophan/goto/74968773d968bb5f2b8f285bd3354002

This was due to AWS-credential-loading in the Ophan Housekeeper lambda being broken by commit 31cec53c65 back in October 2019 - with credential-loading broken, the lambda couldn't load the AWS credentials it needed to delete entries from the ophan-alerts DynamoDB table, or post to the SNS topic.

Perhaps surprisingly, the Ophan Housekeeper lambda only needs those AWS credentials when it's dealing with a permanently bouncing email - so there was no obvious problem until a permanent bounce occurred, starting at 13:26 on February 12th 2020):

Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@1e730495: Unable to load credentials from service endpoint, com.amazonaws.auth.profile.ProfileCredentialsProvider@7d3a22a9: profile file cannot be null]: com.amazonaws.SdkClientException
com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@1e730495: Unable to load credentials from service endpoint, com.amazonaws.auth.profile.ProfileCredentialsProvider@7d3a22a9: profile file cannot be null]
    at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
...
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4805)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4772)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeQuery(AmazonDynamoDBClient.java:2641)
    at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.query(AmazonDynamoDBClient.java:2607)
    at org.scanamo.ops.ScanamoSyncInterpreter.apply(ScanamoSyncInterpreter.scala:35)

As the lambda was broken, it wasn't able to decommission the relevant Ophan Alerts - so Trigr kept on sending them, and there were so many of them, bouncing permanently, that AWS placed our SES account under review.

What next?

Once this fix has been merged, it should hopefully resolve the issue - but note that AWS wants us to follow up with them and let them know what we've done before they take us out of review:

Finally, contact us with answers to the following questions:

  • What caused your high bounce rate?
  • What changes have you made in your email-sending systems or processes?
  • How do these changes ensure that the issue won't occur again in the future? We'll evaluate your responses to these questions. If we agree that your changes address this issue, we'll reset the metrics for your account, and end your review period or restore your account's ability to send email.
rtyley commented 4 years ago

Unfortunately this doesn't appear to have fully resolved the issue, though maybe improved things slightly. The credentials now appear to load, but the DynamoDB query fails):

https://github.com/guardian/ophan-housekeeper/blob/31cec53c65727e5e22e46a377bd156916ae88245/src/main/scala/housekeeper/AlertDeletion.scala#L29

User: arn:aws:sts::021353022223:assumed-role/Ophan-Housekeeper-ExecutionRole-BLAHBLAHBLAAH/Ophan-Housekeeper-Lambda-WOOWOOWOOO
is not authorized to perform: dynamodb:Query on resource: arn:aws:dynamodb:eu-west-1:021353022223:table/ophan-alerts
(Service: AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException; Request ID: GOJ21AFDH19VNCQDNRGNCHE1QNVV4KQNSO5AEMVJF66Q9ASUAAJG):
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException

Our cloudformation actually gives very permissive permissions to the Lambda, so I'm not sure why this is occurring, will investigate.

https://github.com/guardian/ophan-housekeeper/blob/31cec53c65727e5e22e46a377bd156916ae88245/cfn.yaml#L38-L43