radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.45k stars 92 forks source link

[Proposal] Improve functional test reliability #6726

Closed youngbupark closed 1 month ago

youngbupark commented 9 months ago

Problem

The functional tests are designed to validate major Radius features from end to end comprehensively. Radius testing infrastructure executes the functional tests with the following strategies:

Type Purpose Test Sets Cluster Type Target Branch Trigger
PR Gating Functional correctness by new code change Functional Tests Kind in GitHub Action Hosted VM Contributor's PR Branch ok-to-test
Scheduled Test Functional correctness Functional Tests Kind in GitHub Action Hosted VM Main Branch Every 4 hours on weekdays, every 12 hours on weekends
Long Running Tests Service reliability Functional Tests AKS Cluster Main Branch Every 2 hours

So far, we have observed following problems in the failed tests:

Issue Description Resolution
Build infrastructure issues network and GitHub Action hosted VM failures Beyond our control
External dependency issues ARM/AWS endpoint and blob/container registry failures relieved by adding retry but, beyond our control
Conflicted cloud resource names Test code tried to create a test resource with the same name when multiple functional tests are running Resolved by randomizing resource name and leveraging shared resource
Race condition Test code defect Resolved
Resource creation limits Test code created too many test resources. Resolved by periodic clean up workflow
Service unavailabile or unexpected timeouts Service pod was restarting when resource usage hits the resource limit and test code failed. Resolved cpu/memory leak

We have resolved most of issues by implementing randomized resource names, adding retry logic, and enabling a test resource purge workflow. However, we still see the test failures by service unavailable and timeout from applications-rp and UCP pods even though the tests use only localhost network.

Proposal

We propose to leverage long-running test and investigate its service metrics to resolve the flakiness of functional tests which is related to the reliability rather than functional correctness.

Based on prior investigation, when the test driver code gets a service unavailable response or a timeout issue, it typically stems from two scenarios:

  1. The API server of Kind cluster becomes unresponsive due to the limited resources of the build VM.
  2. POD restart occurs when the resource usage of applications-rp exceeds the pod's resource limits.

While adding retry logic in the test driver code might provide a short-term mitigation, it fails to address the underlying long-term issues. To thoroughly investigate these issues, this proposes re-enabling long-running tests and monitoring system metrics from Radius services before making any changes in our code.

Since we resolved all known memory leak issues in the Radius services before, we have introduced new several features, including terraform recipe, which could potentially lead to unknown resource leaks. For instance, deploying multiple Terraform recipes in test spawns multiple new Terraform CLI processes inside container, which can lead to unexpected Pod restart.

Instead of making ad-hoc changes, this proposal recommends to find the actual root cause from the metrics and logs first by following action items:

AB#10296

radius-triage-bot[bot] commented 9 months ago

:wave: @youngbupark Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 9 months ago

:wave: @youngbupark Thanks for filing this issue.

A project maintainer will review this issue and get back to you soon.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 9 months ago

We've prioritized work on this issue. Please subscribe to this issue for notifications, we'll provide updates as we make progress.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

youngbupark commented 9 months ago

Due to the private cluster restriction, we cannot use GitHub Actions hosted pool for long-running test which access K8s API-server. So we need to install self-host runner inside cluster and long-running test workflow should use this self-host runner

actions items.

  1. Create AKS cluster in portal with arcpool for self-host runner agent and private cluster configuration
  2. Export Template on portal before deploying AKS cluster
  3. Install AKS cluster
  4. Install Arc controller to arc pool - https://github.com/actions/actions-runner-controller/tree/master

    we need to restrict agent install to only arc pool

  5. Configure forked radius repo to use self-host runner in AKS cluster (for testing purpose)
  6. Validate self-host runner works as expected - change longrunning test workflow - https://github.com/youngbupark/radius/blob/main/.github/workflows/long-running-azure.yaml
  7. Update bicep template to have the latest change by referring armtemplate from step 2
  8. Validate bicep template

Production

  1. Delete long-running test RG
  2. redeploy by bicep template
  3. Configure self-host runner in radius-project/radius
  4. Update workflow
  5. Load Grafana dashboard to new grafana dashboard
  6. log analytics
radius-triage-bot[bot] commented 9 months ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

youngbupark commented 8 months ago

@ytimocin @shalabhms what's the progress of this work ?

ytimocin commented 3 months ago

I think this is fully done @sylvainsf @nicolejms

ytimocin commented 1 month ago

I will also mark this as complete.