radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.5k stars 97 forks source link

Flaky test: Test_AWS_S3Bucket_Existing #7996

Open kachawla opened 1 month ago

kachawla commented 1 month ago

Steps to reproduce

The test Test_AWS_S3Bucket_Existing failed during a scheduled run: https://github.com/radius-project/radius/actions/runs/11289113473/job/31398477292. Not sure how we can reproduce it, but we should look into the code path related to the creation of AWS resources via UCP, and re-evaluate if there is a read after write operation we are performing which needs to be more resilient through retries.

Observed behavior

Test_AWS_S3Bucket_Existing failed at the execution of first step that deploys a bicep file to create a new S3 bucket: aws-s3-bucket.bicep. The error it returned was bucket NotFound, which doesn't make sense since the operation is creating the resource.

Error logs -

"target": "/planes/aws/aws/accounts/***/regions/us-west-2/providers/AWS.S3/Bucket/radiusfunctionaltestbucket-add9d5d6-80c6-4683-98c9-d1f914a9b272"
    cli.go:341: [rad]     }
    cli.go:341: [rad]   ]
    cli.go:341: [rad] }
    cli.go:341: [rad] 
    cli.go:341: [rad] TraceId:  dc92927587eb7f4b6c4adaadc0b85914
    cli.go:341: [rad] 
    cli.go:341: [rad] 
    deployexecutor.go:83: 
            Error Trace:    /home/runner/work/radius/radius/test/step/deployexecutor.go:83
                                        /home/runner/work/radius/radius/test/rp/rptest.go:392
            Error:          Received unexpected error:
                            code DeploymentFailed: err At least one resource deployment operation failed. Please see the details for the specific operation that failed.
            Test:           Test_AWS_S3Bucket_Existing/deploy_testdata/aws-s3-bucket.bicep
            Messages:       failed to deploy deploy testdata/aws-s3-bucket.bicep
    --- FAIL: Test_AWS_S3Bucket_Existing/deploy_testdata/aws-s3-bucket.bicep (45.83s)

Test run and its artifacts: https://github.com/radius-project/radius/actions/runs/11289113473

Desired behavior

The test should always pass.

Workaround

It passed on subsequent runs.

rad Version

N/A

Operating system

No response

Additional context

No response

Would you like to support us?

AB#13463

radius-triage-bot[bot] commented 1 month ago

:wave: @kachawla Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

kachawla commented 1 month ago

Same issue happened for this test failure as well: https://github.com/radius-project/radius/actions/runs/11303270081/job/31440186010. This one is for Test_AWSRedeployWithUpdatedResourceUpdatesResource, which also creates an S3 bucket.

radius-triage-bot[bot] commented 1 month ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 1 month ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

lakshmimsft commented 3 weeks ago

Tracking: Additional incident with same/similar error : https://github.com/radius-project/radius/actions/runs/11470933017/job/31921112585 Frequency: occured once during 1 week

willdavsmith commented 2 weeks ago

I looked into the AWS CloudTrail logs for a bucket that failed to delete and it looks like AWS is not receiving a DeleteBucket request at all. I suspect we might be getting a 404 from AWS when we call DeleteResource, but we could add some logs to confirm: https://github.com/radius-project/radius/blob/3492b53d0270fe946cded23e8805d915bcb1c29a/test/validation/aws.go#L99

kachawla commented 2 hours ago

I looked into the AWS CloudTrail logs for a bucket that failed to delete and it looks like AWS is not receiving a DeleteBucket request at all. I suspect we might be getting a 404 from AWS when we call DeleteResource, but we could add some logs to confirm:

https://github.com/radius-project/radius/blob/3492b53d0270fe946cded23e8805d915bcb1c29a/test/validation/aws.go#L99

@willdavsmith The test failed while "creating the bucket" so we aren't expecting a DeleteBucket request. Were there any logs for CreateBucket? Also, can you please add a TSG for investigating AWS issues through CloudTrail logs?