hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.76k stars 9.11k forks source link

[Bug]: BucketAlreadyOwnedByYou when creating a bucket with a new unique name, and the bucket is leaked #29028

Open Veetaha opened 1 year ago

Veetaha commented 1 year ago

Terraform Core Version

1.3.7

AWS Provider Version

4.17.1

Affected Resource(s)

Expected Behavior

The S3 bucket must be created successfully if it uses a unique name never ever used before.

Actual Behavior

Terraform may randomly fail to create the bucket.

Relevant Error/Panic Output Snippet

╷
  │ Error: error creating S3 Bucket (elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt): BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
  │     status code: 409, request id: {redacted}, host id: {redacted}
  │ 
  │   with module.jobs_status.aws_s3_bucket.jobs_status_attachments,
  │   on ../../../modules/jobs_status/main.tf line 128, in resource "aws_s3_bucket" "jobs_status_attachments":
  │  128: resource "aws_s3_bucket" "jobs_status_attachments" {
  │ 
  ╵

Terraform Configuration Files

resource "aws_s3_bucket" "jobs_status_attachments" {
  bucket        = "job-attachments"
  force_destroy = true

  tags = { "elastio:resource" = "true", /* and ~10 more wellformed tags with non-empty values here */ }
}

Steps to Reproduce

There isn't a stable reproduction. It happens randomly and rarely during terraform apply. I suppose if you run terraform apply bazillion times, you may be lucky to catch this error. More on "bazillion" below.

We run terraform apply that creates buckets on CI extensively in our tests. Each test generates a unique bucket name, so there should not be a problem with reusing a bucket name. I also checked the regions where we deploy our buckets with the unique names (you may get the idea of the bucket naming pattern we use, but that's not very relevant). No region has ever seen the creation of the bucket name specified in the error message for at least the last 3 months. I'd say it's statistically improbable we could ever reuse a bucket name during our deployments.

However, I found 4 buckets leaked due to this error on the following dates:

I don't know why this error started appearing more often for the last week, but it started to hurt us, so we are reporting it. I wouldn't err on our code doing something wrong because Cloudtrail logs don't indicate that

Debug Output

Unfortunately, we don't run terraform with TF_LOG=debug, so we don't have debug logs.

Panic Output

No response

Important Factoids

During my investigation on the problem I took a look at CloudTrail logs, and found that each time we get this BucketAlreadyOwnedByYou there are two CreateBucket API calls invoked within the same second by terraform.

image

You can see that the selected time range is "last 3 months", which proves we never used such bucket name, and I also checked other regions where we do deployments, they have zero CloudTrail events with this bucket name

I am entirely sure it happens within the same terraform process on the same machine, so I am sure we can exclude concurrent deployments of the same stack from the potential causes.

Unfortunately, CloudTrail's date-recording granularity is 1 second, so I can't tell for sure which of the two API calls was made first, but I think the order is obvious when we take a look at the CloudTrail events themselves. I am pasting them in the order that I guess they happened:

{
  "eventVersion": "1.08",
  "userIdentity": {
      "type": "IAMUser",
      "principalId": "{REDACTED}",
      "arn": "arn:aws:iam::{REDACTED}:user/deployer",
      "accountId": "{REDACTED}",
      "accessKeyId": "{REDACTED}",
      "userName": "deployer"
  },
  "eventTime": "2023-01-21T18:57:54Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "CreateBucket",
  "awsRegion": "us-west-2",
  "sourceIPAddress": "{REDACTED}",
  "userAgent": "[APN/1.0 HashiCorp/1.0 Terraform/1.3.7 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.25 (go1.17.6; linux; amd64)]",
  "errorCode": "OperationAborted",
  "errorMessage": "A conflicting conditional operation is currently in progress against this resource. Please try again.",
  "requestParameters": {
      "CreateBucketConfiguration": {
          "LocationConstraint": "us-west-2",
          "xmlns": "http://s3.amazonaws.com/doc/2006-03-01/"
      },
      "bucketName": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt",
      "Host": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt.s3.us-west-2.amazonaws.com",
      "x-amz-acl": "private"
  },
  "responseElements": null,
  "additionalEventData": {
      "SignatureVersion": "SigV4",
      "CipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
      "bytesTransferredIn": 153,
      "AuthenticationMethod": "AuthHeader",
      "x-amz-id-2": "{REDACTED}",
      "bytesTransferredOut": 347
  },
  "requestID": "{REDACTED}",
  "eventID": "{REDACTED}",
  "readOnly": false,
  "eventType": "AwsApiCall",
  "managementEvent": true,
  "recipientAccountId": "{REDACTED}",
  "eventCategory": "Management",
  "tlsDetails": {
      "tlsVersion": "TLSv1.2",
      "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
      "clientProvidedHostHeader": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt.s3.us-west-2.amazonaws.com"
  }
}

This first API call for some reason got OperationAborted complaining about a conflicting concurrent operation running for the bucket. However, this API did successfully create a bucket (I will prove that a bit later below), and therefore the second API call includes the BucketAlreadyOwnedByYou error:

{
  "eventVersion": "1.08",
  "userIdentity": {
      "type": "IAMUser",
      "principalId": "{REDACTED}",
      "arn": "arn:aws:iam::{REDACTED}:user/deployer",
      "accountId": "{REDACTED}",
      "accessKeyId": "{REDACTED}",
      "userName": "deployer"
  },
  "eventTime": "2023-01-21T18:57:54Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "CreateBucket",
  "awsRegion": "us-west-2",
  "sourceIPAddress": "{REDACTED}",
  "userAgent": "[APN/1.0 HashiCorp/1.0 Terraform/1.3.7 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.25 (go1.17.6; linux; amd64)]",
  "errorCode": "BucketAlreadyOwnedByYou",
  "errorMessage": "Your previous request to create the named bucket succeeded and you already own it.",
  "requestParameters": {
      "CreateBucketConfiguration": {
          "LocationConstraint": "us-west-2",
          "xmlns": "http://s3.amazonaws.com/doc/2006-03-01/"
      },
      "bucketName": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt",
      "Host": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt.s3.us-west-2.amazonaws.com",
      "x-amz-acl": "private"
  },
  "responseElements": null,
  "additionalEventData": {
      "SignatureVersion": "SigV4",
      "CipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
      "bytesTransferredIn": 153,
      "AuthenticationMethod": "AuthHeader",
      "x-amz-id-2": "{REDACTED}",
      "bytesTransferredOut": 415
  },
  "requestID": "{REDACTED}",
  "eventID": "{REDACTED}",
  "readOnly": false,
  "eventType": "AwsApiCall",
  "managementEvent": true,
  "recipientAccountId": "{REDACTED}",
  "eventCategory": "Management",
  "tlsDetails": {
      "tlsVersion": "TLSv1.2",
      "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
      "clientProvidedHostHeader": "elastio-xmkewe-exp-1674349053-job-attachments-fycop1rjt.s3.us-west-2.amazonaws.com"
  }
}

So the first API call did succeed to create a bucket, even if it returned an OperationAborted error. The fact that the bucket does exist (although it has no tags), and its creation date is one second after the API calls described higher proves my thinking:

image

I think this may be the case where this is AWS fault, where they actually succeed at creating a bucket, but return an error to us. However, AWS has a ton of bugs in their APIs, and I think terraform should be capable of working around them for all of terraform's users. I think the workaround from terraform's side would be to detect this particular error of OperationAborted and to check if the bucket was created when this error is returned. If it is, then we consider the bucket as created successfully.

References

No response

Would you like to implement a fix?

No

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

Veetaha commented 1 year ago

Today (2023-07-04) I faced the same failure of bucket deployment in us-west-2

AlexandreGohier commented 7 months ago

If you face this issue, check your CloudTrail logs, maybe you are creating another resource in your Terraform code that will automatically create the S3 bucket for you.

I ran into this today while creating a aws_athena_database resource and a aws_s3_bucket resource alongside to store the Athena results. Based on the CloudTrail logs, Terraform created the Athena database first, then I guess the underlying AWS API detected that the S3 bucket I specified to store the Athena results did not exist and created it for me (the first CreateBucket event has this source: "invokedBy": "athena.amazonaws.com").

So a few milliseconds later, when Terraform tried to create the bucket, it said I already own it and failed. I then needed to import the bucket in my state and apply again to finish configuring it...

In my case, I fixed it by adding depends_on = [aws_s3_bucket.athena-bucket] to my aws_athena_database resource.