radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.49k stars 95 forks source link

[Functional Test Flakiness] Improve Reliability of AWS Resource Validation #7994

Open kachawla opened 3 weeks ago

kachawla commented 3 weeks ago

Area for Improvement

AWS functional tests

Observed behavior

We have a functional test that creates an AWS S3 bucket, validates its existence by performing a get request, returns an error if the get call results in an error (including 404), and finally performs a cleanup step to delete the resources, even if the previous step returned an error. In a recent run, this test failed during the validation of the S3 bucket creation because the bucket could not be found. Logs -

cli.go:341: [rad]     radiusfunctionaltestbucket-aca3abad-f25b-49d3-be9f-bd1f7f543171 AWS.S3/Bucket       
    cli.go:341: [rad] 
    deployexecutor.go:84: finished deploying deploy testdata/aws-s3-bucket.bicep from file testdata/aws-s3-bucket.bicep
    rptest.go:3[93](https://github.com/radius-project/radius/actions/runs/11243305588/job/31259088130#step:26:94): finished running step 0 of 1: deploy testdata/aws-s3-bucket.bicep
    rptest.go:396: skipping validation of resources...
    rptest.go:411: validating output resources for deploy testdata/aws-s3-bucket.bicep
    aws.go:72: 
            Error Trace:    /home/runner/work/radius/radius/test/validation/aws.go:72
                                        /home/runner/work/radius/radius/test/rp/rptest.go:413
make: *** [build/test.mk:89: test-functional-corerp-cloud] Error 1
            Error:          Received unexpected error:
                            operation error CloudControl: GetResource, https response error StatusCode: 400, RequestID: b546bee7-401d-4efe-9bc8-7a9ecd041e27, ResourceNotFoundException: AWS::S3::Bucket Handler returned status FAILED: Bucket not found (HandlerErrorCode: NotFound, RequestToken: a64[102](https://github.com/radius-project/radius/actions/runs/11243305588/job/31259088130#step:26:103)cb-8ab2-407d-8330-e85842af474e)
            Test:           Test_AWS_S3Bucket/deploy_testdata/aws-s3-bucket.bicep
    --- FAIL: Test_AWS_S3Bucket/deploy_testdata/aws-s3-bucket.bicep (40.26s)

It errored out again during cleanup stage unable to validate deletion of the bucket -

rptest.go:458: validating deletion of AWS resource for radiusfunctionaltestbucket-aca3abad-f25b-49d3-be9f-bd1f7f543171 (attempt 5/5)
    rptest.go:475: 
            Error Trace:    /home/runner/work/radius/radius/test/rp/rptest.go:475
            Error:          Should be true
            Test:           Test_AWS_S3Bucket
            Messages:       AWS resource radiusfunctionaltestbucket-aca3abad-f25b-49d3-be9f-bd1f7f543171 was present, should be not found

DONE 20 tests, 2 skipped, 2 failures in 151.163s

Link to functional test run: https://github.com/radius-project/radius/actions/runs/11243305588/job/31259088130

Desired behavior

The test should always pass.

Proposed Fix

  1. From the logs above, it's clear that the bucket was eventually created. So adding retries with backoff on the validation path here should help mitigate intermittent failures.

  2. The second failure happened because the bucket wasn't deleted within the time allocated for the test validation. We should consider increasing the max retry limit here.

  3. The backoff logic could be improved as well - currently there is a fixed 10 second forced wait between retries. We could start with a smaller wait time and exponentially increase it for cases where operations are delayed due to external issues. Since the buckets created for this test don't contain any objects, the creation and deletion should be fairly quick in most cases.

rad Version

N/A

Operating system

N/A

Additional context

No response

Would you like to support us?

AB#13454

radius-triage-bot[bot] commented 3 weeks ago

:wave: @kachawla Thanks for filing this issue.

A project maintainer will review this issue and get back to you soon.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

kachawla commented 3 weeks ago

Another test run where this error happened on bucket deletion validation path: https://github.com/radius-project/radius/actions/runs/11269904213/job/31339494783

radius-triage-bot[bot] commented 2 weeks ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 2 weeks ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

sk593 commented 1 day ago

Seen again in a scheduled functional test run: https://github.com/radius-project/radius/actions/runs/11622220142/job/32367444436