brianherrera commented 1 year ago

Overview

SIG-Build is investigating making changes to the AR (Automated Review) to prioritize lowering infrastructure costs while still working on improving the developer experience and reducing build times.

The purpose of this document is to initiate a discussion with the TSC (Technical Steering Committee) and all SIGs regarding the current state of the AR and how it can be updated to lower our infrastructure costs. Most of the discussion around the proposed options will remain at a high level with the implementation details to be outlined later.

What is the AR?

The AR (Automated Review) refers to a set of jobs that perform automated testing for pull requests submitted to O3DE. These tests are required to pass before PRs are merged. A SIG maintainer can run the AR on a pull request after they've performed their code review.

Check as shown in PRs (under CI/jenkins/pr-merge):

Jenkins pipeline view (shown after clicking on the Details link):

AR Infrastructure

The AR jobs are orchestrated by Jenkins Pipeline and are defined using a Jenkinsfile stored in the O3DE repo: https://github.com/o3de/o3de/blob/development/scripts/build/Jenkins/Jenkinsfile.

The Jenkins server and its build nodes are hosted in AWS. The build nodes utilize EC2 instances for the hosts and EBS volumes for the build workspace. When a pull request is submitted a pipeline is generated, however build nodes are not created until an AR run is triggered. After an AR run starts, the required platform instances are spun up, builds/tests are executed, and the instances are terminated at the end of the run. Jenkins then reports the results back to the pull request. The EBS volume is retained to cache artifacts for the next run. A daily automated task deletes the EBS volumes for merged and inactive PRs.

Using this setup allows us to prioritize faster build times. However to reduce our costs, onboard new platforms, and maintain our developer velocity we will be investigating the following options.

Options

SIG-Build is currently investigating the following options to reduce our infrastructure costs while still improving the PR experience for developers.

1 - Migrate Jobs to GitHub Actions

The main reason to migrate to GitHub Actions is to take advantage of the lower cost infrastructure and allow contributors to easily run tests in their own fork.

Lower cost: GitHub provides free usage of their standard hosted runners for public repos. These runners do have limited specs, so the main trade-off will be slower build times.
Easier testing: When GitHub actions are added to a repo, those workflows are also included when a user forks the repo making it easier for them to run those same actions prior to submitting a PR. While we have automated our Jenkins Pipeline setup (https://github.com/o3de/o3de-jenkins-pipeline), providing GitHub Actions minimizes any infrastructure setup for contributors looking to test their changes.

We have already migrated quicker running tests (Validation) to GitHub Actions. The next section will review the challenges of migration our builds and tests as well.

Validation check: https://github.com/o3de/o3de/actions/workflows/validation.yaml

GitHub Standard Hosted Runners

GitHub actions provides free usage for public repos using their standard hosted runners. The main trade-off with the lower cost is the slower build times using GitHub's runners compared to our self-hosted ones using 16-core machines. Changes to the core engine code will result in 4-5 hour build times.

Details on the hardware specs for GitHub's standard hosted runners.

Operating System	CPU	Memory	Storage
Windows/Linux	2-core CPU (x86_64)	7GB	14GB SSD
MacOS	3-core CPU (x86_64)	14GB	14GB SSD

More details info on the hosted runners can be found here: GitHub-hosted Runners

Hard Drive Space Limitations

GitHub-hosted runners only provide 14GB of storage (mounted on /mnt) for all supported platforms. There is also a root volume that has about 29GB of space remaining out of the 84GB total which is used for the pre-installed environment.

Proposed workaround:

In the default configuration, the provided storage is not enough to support the build workspace for O3DE. In order to increase the available space beyond 14GB, we need to uninstall un-required software and utilize the space in the root volume.

Note: This is not an officially supported workflow, but is a currently available workaround.

This GitHub action is currently being tested: https://github.com/marketplace/actions/maximize-build-disk-space

In addition to removing the pre-installed software, this GitHub action frees up space by performing the following steps (Note: This is only required for Linux nodes. For other platforms we can simply move the workspace to the root volume.):

Combines the free space of the root volume and /mnt to an LVM volume group
Creates a swap partition and a build volume on that volume group
Mounts the volume back to ${GITHUB_WORKSPACE}

This results in a workspace with about 50GB of hard drive space.

Recommended Use-Cases

Below are some of the use-cases that will be compatible with GitHub's standard hosted runners.

Code Checks

These are checks that run on the code base that don't require high CPU usage or require large build artifacts. This includes code style checks, static code analysis, etc. A simple clone of the O3DE repo doesn't run into any of the space limitations.

Example use cases:

Pre-AR checks: Some examples are the DCO check and Validation tests. These run when the PR is created and report its status before the AR is triggered by a maintainer.
Static code analysis: Current investigations by SIG-Security involve using CodeQL. This can also be integrated with Dependabot for automated dependency updates. See: https://github.com/o3de/o3de/issues/10032

Periodic Builds

Daily/Weekly builds that are not as time sensitive as AR builds would be a good use case for the standard runners.

GitHub Larger Runners

GitHub also provides larger spec runners (up to 64 cores) that can be added to GitHub action workflows. The specs and billing rates can be found here:

Comparison between GitHub Larger Runners and AWS EC2 Instances (using 16-core machines):

Platform	AWS EC2 (Hourly)	GitHub Hosted Runner (Hourly)
Windows (16 core)	$1.35	$7.68
Linux (16 core)	$0.62	$3.84

EC2 pricing is based on On-Demand instances running in US-West-2.

Linux Foundation Hosted Runners

This option would be similar to the GitHub runners listed above, but would take advantage of any available build infrastructure hosted by the Linux Foundation. SIG-Build will need to investigate if there are resources that are available for O3DE.

2 - Reduce the Frequency of AR Runs

Another method to lower our costs is to reduce the number of AR runs required per PR. This option can be implemented to lower costs with our current Jenkins infrastructure and can also be used to optimize the GitHub actions workflow discussed earlier. The focus here would be to batch PRs using merge queues, separate tests into smaller jobs, and make certain tests optional.

Merge queues: Batching PRs so they are tested together in a single AR run will help reduce the total number of AR runs required over time.
Job separation: Currently if an AR job fails, the entire pipeline needs to be re-run. Additionally, all tests currently run within a single job and cannot be run selectively. There are on-going investigations to limit testing scope based on the changes submitted.
Optional tests: We can also limit the scope of required tests at the PR stage and can defer some testing to the development branch after the changes have been merged. This option depends on the SIGs willingness to accept risks in build/test failures in dev in favor for lower cost/developer velocity.

Merge Queues

Using a merge queue reduces the total number of AR runs by batching approved PRs and testing those changes with a single AR. If the AR passes, those PRs are then merged in together. GitHub recently released their merge queue feature in limited public beta and there are other third-party solutions as well that are free for open source projects. The merge queue would reduce costs by reducing the total number of AR runs over time. However with this strategy we need to implement mechanisms to address AR failures.

Build Failures

One item SIG-Build is currently investigating is a mechanism to identify the owners of each build failure: https://github.com/o3de/sig-build/blob/main/rfcs/rfc-bld-20220623-1-build-failure-rca.md

This can also be used to identify the PR associated with the build failure, remove it from the queue, and restart the AR run. There are other strategies like bisecting the queue and rerunning the AR separately to identify the problematic PR, but this does come at an increased cost.

Another item we will need to address are un-reliable tests that typically require re-running the AR in order to pass. Failures like these in a merge queue setup will cause unnecessary merge queue runs and will frustrate developers.

Limit AR Runs (Opt-in)

This option is more of a workflow change and relies on maintainer to run jobs based on type of changes (e.g. small bug fix, core engine changes, etc.). Contributors can also confirm they've run the provided tests in their fork.

Split-up Testing Tasks

At the moment our test suites are executed in a single Jenkins job that typically tasks 40-50min to run for an incremental build and up to 2 hours for a clean build on a 16-core machine. Splitting up the tests will allow us to selectively run tests in a manual or automated way. This will also make it more feasible to run tests within the GitHub Actions workflow detailed above.

lmbr-pip commented 1 year ago

Thanks for raising this.

Has this been raised at the TSC? As this impacts all SIGs it needs to be widely advertised. All SIGs should be notified and we should be proactively reaching out to all SIGs IMHO.

brianherrera commented 1 year ago

Yes, this was brought up during the last TSC meeting. The initial draft is currently being reviewed within SIG-Build and other groups. We will also request feedback to all SIGs after this initial revision.

Kadino commented 1 year ago

When you are ready to solicit feedback, I recommend doing so with a pull request as the conversational features and versioning are superior to commenting on issues. We recently moved to this workflow over in SIG-Testing, and find it flows a bit better.

allisaurus commented 1 year ago

There are on-going investigations to limit testing scope based on the changes submitted.

Does this just refer to Test Impact Analysis, or are there other lines of inquiry open? (also: would any of the suggestions in this RFC be mutually exclusive with TIAF?)

(GH runner workaround) This results in a workspace with about 50GB of hard drive space.

Is that sufficient to build for all platforms?

brianherrera commented 1 year ago

There are on-going investigations to limit testing scope based on the changes submitted.

Does this just refer to Test Impact Analysis, or are there other lines of inquiry open? (also: would any of the suggestions in this RFC be mutually exclusive with TIAF?)

(GH runner workaround) This results in a workspace with about 50GB of hard drive space.

Is that sufficient to build for all platforms?

Yes, thanks for the link! I was planning to update this section to callout the TIAF efforts. The suggestions here should not conflict with TIAF, even running test jobs in parallel would still benefit from it. If there are further redesigns proposed to the testing setup this would require collaboration with sig-testing.

Kadino commented 1 year ago

Split-up Testing Tasks At the moment our test suites are executed in a single Jenkins job that typically tasks 40-50min to run for an incremental build and up to 2 hours for a clean build on a 16-core machine

This appears to be referencing build + assets + test, not the tests themselves. Tests should not require additional execution time between clean and incremental builds. If so, that sounds like a bug I'd like to investigate.

Can you provide the data that proves it takes over 2x longer for clean tests? (or does the data perhaps describe build + test?)

Edit: investigated locally and found that there was around an 80-second (9%) increase from the first time tests execute after cleaning caches. With the tests taking between 780-900 seconds to execute. This is at most 15 minutes, and appears consistent with Jenkins windows builds. Cut a bug to investigate what appears to be asset related issues causing the increase: https://github.com/o3de/o3de/issues/14689

brianherrera commented 1 year ago

Closing this issue. It will be replaced by this one https://github.com/o3de/sig-build/issues/81

o3de / sig-build

RFC - AR Redesign Proposal #77

Overview

What is the AR?

AR Infrastructure

Options

1 - Migrate Jobs to GitHub Actions

GitHub Standard Hosted Runners

Hard Drive Space Limitations

Recommended Use-Cases

Code Checks

Periodic Builds

GitHub Larger Runners

Linux Foundation Hosted Runners

2 - Reduce the Frequency of AR Runs

Merge Queues

Limit AR Runs (Opt-in)

Split-up Testing Tasks