Closed youngbupark closed 1 month ago
:wave: @youngbupark Thanks for filing this bug report.
A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.
For more information on our triage process please visit our triage overview
:wave: @youngbupark Thanks for filing this issue.
A project maintainer will review this issue and get back to you soon.
We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.
For more information on our triage process please visit our triage overview
We've prioritized work on this issue. Please subscribe to this issue for notifications, we'll provide updates as we make progress.
We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.
For more information on our triage process please visit our triage overview
Due to the private cluster restriction, we cannot use GitHub Actions hosted pool for long-running test which access K8s API-server. So we need to install self-host runner inside cluster and long-running test workflow should use this self-host runner
we need to restrict agent install to only arc pool
:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.
We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.
For more information on our triage process please visit our triage overview
@ytimocin @shalabhms what's the progress of this work ?
I think this is fully done @sylvainsf @nicolejms
I will also mark this as complete.
Problem
The functional tests are designed to validate major Radius features from end to end comprehensively. Radius testing infrastructure executes the functional tests with the following strategies:
ok-to-test
So far, we have observed following problems in the failed tests:
We have resolved most of issues by implementing randomized resource names, adding retry logic, and enabling a test resource purge workflow. However, we still see the test failures by service unavailable and timeout from applications-rp and UCP pods even though the tests use only localhost network.
Proposal
We propose to leverage long-running test and investigate its service metrics to resolve the flakiness of functional tests which is related to the reliability rather than functional correctness.
Based on prior investigation, when the test driver code gets a service unavailable response or a timeout issue, it typically stems from two scenarios:
While adding retry logic in the test driver code might provide a short-term mitigation, it fails to address the underlying long-term issues. To thoroughly investigate these issues, this proposes re-enabling long-running tests and monitoring system metrics from Radius services before making any changes in our code.
Since we resolved all known memory leak issues in the Radius services before, we have introduced new several features, including terraform recipe, which could potentially lead to unknown resource leaks. For instance, deploying multiple Terraform recipes in test spawns multiple new Terraform CLI processes inside container, which can lead to unexpected Pod restart.
Instead of making ad-hoc changes, this proposal recommends to find the actual root cause from the metrics and logs first by following action items:
applications-rp
resource or increase replicas ofapplications-rp
from 1 to 2+.AB#10296