[Proposal] Improve functional test reliability

youngbupark commented 9 months ago

Problem

The functional tests are designed to validate major Radius features from end to end comprehensively. Radius testing infrastructure executes the functional tests with the following strategies:

Type	Purpose	Test Sets	Cluster Type	Target Branch	Trigger
PR Gating	Functional correctness by new code change	Functional Tests	Kind in GitHub Action Hosted VM	Contributor's PR Branch	`ok-to-test`
Scheduled Test	Functional correctness	Functional Tests	Kind in GitHub Action Hosted VM	Main Branch	Every 4 hours on weekdays, every 12 hours on weekends
Long Running Tests	Service reliability	Functional Tests	AKS Cluster	Main Branch	Every 2 hours

So far, we have observed following problems in the failed tests:

Issue	Description	Resolution
Build infrastructure issues	network and GitHub Action hosted VM failures	Beyond our control
External dependency issues	ARM/AWS endpoint and blob/container registry failures	relieved by adding retry but, beyond our control
Conflicted cloud resource names	Test code tried to create a test resource with the same name when multiple functional tests are running	Resolved by randomizing resource name and leveraging shared resource
Race condition	Test code defect	Resolved
Resource creation limits	Test code created too many test resources.	Resolved by periodic clean up workflow
Service unavailabile or unexpected timeouts	Service pod was restarting when resource usage hits the resource limit and test code failed.	Resolved cpu/memory leak

We have resolved most of issues by implementing randomized resource names, adding retry logic, and enabling a test resource purge workflow. However, we still see the test failures by service unavailable and timeout from applications-rp and UCP pods even though the tests use only localhost network.

Proposal

We propose to leverage long-running test and investigate its service metrics to resolve the flakiness of functional tests which is related to the reliability rather than functional correctness.

Based on prior investigation, when the test driver code gets a service unavailable response or a timeout issue, it typically stems from two scenarios:

The API server of Kind cluster becomes unresponsive due to the limited resources of the build VM.
POD restart occurs when the resource usage of applications-rp exceeds the pod's resource limits.

While adding retry logic in the test driver code might provide a short-term mitigation, it fails to address the underlying long-term issues. To thoroughly investigate these issues, this proposes re-enabling long-running tests and monitoring system metrics from Radius services before making any changes in our code.

Since we resolved all known memory leak issues in the Radius services before, we have introduced new several features, including terraform recipe, which could potentially lead to unknown resource leaks. For instance, deploying multiple Terraform recipes in test spawns multiple new Terraform CLI processes inside container, which can lead to unexpected Pod restart.

Instead of making ad-hoc changes, this proposal recommends to find the actual root cause from the metrics and logs first by following action items:

[x] Re-enable long-running test pipeline and monitor memory and cpu usage while running all functional tests.
[x] Run long-running test for a couple of weeks and investigate all unexpected behaviors represented by metrics and logs.
[x] Make the right fix from the RCA. For example, (If resource usage hits the limit) Increase the CPU and memory limit for applications-rp resource or increase replicas of applications-rp from 1 to 2+.
[x] Enable proactive alerts for long-running tests.
[x] Create the troubleshooting guide for how to investigate the problems.

AB#10296

radius-triage-bot[bot] commented 9 months ago

:wave: @youngbupark Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 9 months ago

:wave: @youngbupark Thanks for filing this issue.

A project maintainer will review this issue and get back to you soon.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 9 months ago

We've prioritized work on this issue. Please subscribe to this issue for notifications, we'll provide updates as we make progress.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

youngbupark commented 9 months ago

Due to the private cluster restriction, we cannot use GitHub Actions hosted pool for long-running test which access K8s API-server. So we need to install self-host runner inside cluster and long-running test workflow should use this self-host runner

actions items.

Create AKS cluster in portal with arcpool for self-host runner agent and private cluster configuration
Export Template on portal before deploying AKS cluster
Install AKS cluster
Install Arc controller to arc pool - https://github.com/actions/actions-runner-controller/tree/master

we need to restrict agent install to only arc pool
Configure forked radius repo to use self-host runner in AKS cluster (for testing purpose)
Validate self-host runner works as expected - change longrunning test workflow - https://github.com/youngbupark/radius/blob/main/.github/workflows/long-running-azure.yaml
Update bicep template to have the latest change by referring armtemplate from step 2
Validate bicep template

Production

Delete long-running test RG
redeploy by bicep template
Configure self-host runner in radius-project/radius
Update workflow
Load Grafana dashboard to new grafana dashboard
log analytics

radius-triage-bot[bot] commented 9 months ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

youngbupark commented 8 months ago

@ytimocin @shalabhms what's the progress of this work ?

ytimocin commented 3 months ago

I think this is fully done @sylvainsf @nicolejms

ytimocin commented 1 month ago

I will also mark this as complete.

radius-project / radius