Migrate jobs off current GCP GHA runner cluster

ScottTodd commented 4 weeks ago

Following the work at https://github.com/iree-org/iree/issues/17957 and https://github.com/iree-org/iree/issues/16203, it is just about time to migrate away from the GitHub Actions runners hosted on Google Cloud Platform.

Workflow refactoring tasks

Refactor workflows such that they don't depend on GCP:

[x] Docker prefetch/preload
[x] Installed packages like the gcloud command
[x] Read/write access to the remote ccache storage bucket at http://storage.googleapis.com/iree-sccache/ccache (configured using setup_ccache.sh)
[x] General reliance on the build_tools/github_actions/docker_run.sh script

Runner setup tasks

[x] Read up on https://github.com/actions/actions-runner-controller and give it a try
[x] Add Linux x86_64 CPU builders
[x] Experiment with core count: 16 cores minimum, 96 cores ideal?
[x] Experiment with autoscaling instances: up to 8-16 max? scale down to 1 at midnight PST?
[ ] Add Linux NVIDIA GPU runner(s): can use small/cheap GPUs like NVIDIA T4s we current test on - need baseline coverage for CUDA and Vulkan
[ ] Add other runners: arm64? Android? Windows? Some of these could be off the cloud and just run in local labs
[x] Consider setting up a remote cache storage bucket/account. 10GB minimum - ideally located on a network close to the runners
[x] Consider prepopulating caches on runners somehow: git repository / submodules, Dockerfiles, test inputs
[x] Register new runners in iree-org (organization) or iree-org/iree (repository)
[x] Decide on how runners should be distributed. We currently have separate pools for "presubmit" and "postsubmit"
[ ] Research monitoring/logging (queue times, uptime, autoscaling usage, crash frequency, etc.)

Transition tasks

[x] Switch a few non-critical jobs (like the nightly 'debug or 'tsan' jobs) to the new runners and monitor for stability, performance, etc.

Switch all jobs that need a self hosted runner to the new runners

[x] linux_x86_64_release_packages in pkgci_build_packages.yml
[x] linux_x64_clang in ci_linux_x64_clang.yml
[x] linux_x64_clang_asan in ci_linux_x64_clang_asan.yml
[x] linux_x64_clang_tsan in ci_linux_x64_clang_tsan.yml
[x] linux_x64_clang_debug in ci_linux_x64_clang_debug.yml
[ ] (stretch) build_test_all_bazel in ci.yml
[ ] (stretch) linux_arm64_clang in ci_linux_arm64_clang.yml
[ ] (stretch) build_packages (arm64) in build_package.yml
[ ] (stretch) test in pkgci_test_nvidia_t4.yml
[ ] (stretch) nvidiagpu_cuda in pkgci_regression_test.yml
[ ] (stretch) nvidiagpu_vulkan in pkgci_regression_test.yml

Other

[ ] Deregister and spin down the old runners
[ ] Add any new documentation to https://iree.dev/developers/general/github-actions/#maintenance-tips
[ ] Move workflows back from nightly to running on every commit, if we have capacity for it (debug, tsan, gcc, byollvm)

ScottTodd commented 4 weeks ago

Experiments are showing that local ccache using github actions is going to be nowhere near functional for some of the current CI builds. Maybe I have something misconfigured, but I'm seeing cache sizes of up to 2GB still not be enough for Debug or ASan jobs. I can try running with no cache limit to see what that produces, but GitHub's soft limit of 10GB across all cache entries before it starts evicting entries will trigger very frequently if we have too many jobs using unique cache keys.

amd-chrissosa commented 3 weeks ago

Experiments so far:

I have gone through https://github.com/actions/actions-runner-controller and gave it a try through a basic POC but many things still aren't working yet.

To replicate what I've done so far:

Create an AKS cluster - creating a nodepool that is set up to autoscale.
Enabled AKS after installing Helm on my local client. I suggest creating your own values.yaml file in order to set the necessary values which you'll have to work with.
Configured a new workflow to use the runners setup in this config.

These all work fairly out of the box. Few suggestions:

Use different node pools for the linux_x86_64 builders vs the Linux NVIDIA GPU runner(s). Suggest getting basic pre-ci / ci working through the x86_64 builders first.
Don't worry too much about autoscaling settings for now, they are very easy to reconfigure. Suggest setting up autoscaling to have a min of 3 nodes and at most something like 20 to be safe for the original node pool.
Distinguish between different uses using different runner scale sets. Runner scale sets are homogeneous runners - they have the same runner config. Of course you can just use one runner set and customize as part of the build but you can install any number of runner_sets per k8s namespace/cluster

Currently blocked - getting images working. Going to keep trying to work on this but may pull someone in to help at this point since the k8s part is at least figured out.

ScottTodd commented 3 weeks ago

I created https://github.com/iree-org/base-docker-images and am working to migrate what's left in https://github.com/iree-org/iree/tree/main/build_tools/docker to that repo. Starting with a few workflows that don't have special GCP requirements right now like https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml.

Local testing of https://github.com/iree-org/base-docker-images/pull/4 looks promising to replace gcr.io/iree-oss/base with a new ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64 (or we can just put ghcr.io/iree-org/cpubuilder_ubuntu_jammy_ghr_x86_64 on the cluster for those builds, instead of using Docker inside Docker).

We could also try using the manylinux image but I'm not sure if we should expect that to work well enough with the base C++ toolchains outside of python packaging. I gave that a try locally too but got errors like:

# python3 -m pip install -r ./runtime/bindings/python/iree/runtime/build_requirements.txt
WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead.
Collecting pip>=21.3 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 1.6MB/s 
Collecting setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7))
  Could not find a version that satisfies the requirement setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)) (from versions: 0.6b1, 0.6b2, 0.6b3, 0.6b4, 0.6rc1, ...
... 59.3.0, 59.4.0, 59.5.0, 59.6.0)
No matching distribution found for setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)

ScottTodd commented 3 weeks ago

If we're not sure how we want to set up a remote cache by the time we want to transition, I could at least prep a PR that switches relevant workflows to stop using a remote cache.

ScottTodd commented 2 weeks ago

Shared branch tracking the migration: https://github.com/iree-org/iree/tree/shared/runner-cluster-migration

That currently switches the runs-on: for multiple jobs to the new cluster and changes some workflows from using the GCP cache to using no cache. We'll try setting up a new cache and continue testing there before merging to main.

ScottTodd commented 4 days ago

We're still figuring out how to get build times back to reasonable on the new cluster by configuring some sort of cache. The linux_x64_clang build is taking around 30 minutes for the entire job on the new runner cluster with no cache, compared to 9 minutes for the entire job on old runners with a cache.

ccache (https://ccache.dev/) does not have first class support for Azure Blob Storage, so we are trying a few things:

Not sure if Azure supports HTTP access in the way that GCP does: https://github.com/iree-org/iree/blob/7212b485a313c1d67097b091a10b7a7a5b72d150/build_tools/cmake/setup_ccache.sh#L58-L65
We've tried using blobfuse2 (https://github.com/Azure/azure-storage-fuse) to mount the remote directory and treat it as local (blobfuse2 mount ... /mnt/azureblob + CCACHE_DIR=/mnt/azureblob/ccache-container), but that has some confusing configuration and doesn't appear to support multiple concurrent readers/writers:

Blobfuse2 supports both reads and writes however, it does not guarantee continuous sync of data written to storage using other APIs or other mounts of Blobfuse2. For data integrity it is recommended that multiple sources do not modify the same blob/file.

sccache (https://github.com/mozilla/sccache) is promising since it does have first class support for Azure Blob Storage: https://github.com/mozilla/sccache/blob/main/docs/Azure.md

Either way we still need to figure out the security/access model. Ideally we'd have public read access the cache, but we might need to limit even that if the APIs aren't available. Might have to make some (temporary?) tradeoffs where only PRs sent from the main repo would get access to the cache via GitHub Secrets (which aren't shared with PRs from forks) :slightly_frowning_face:

benvanik commented 4 days ago

As a data point I've used sccache locally and it worked as expected for our cmake builds.

ScottTodd commented 4 days ago

Yep I just had good results with sccache locally on Linux and using Azure. I think good next steps are:

Install sccache in the dockerfiles: https://github.com/iree-org/base-docker-images/pull/8
Test sccache inside Docker (or skip this step if confident in the cache hit rates and such)
Switch the test PR (https://github.com/iree-org/iree/pull/18466) to use sccache instead of ccache and confirm that github actions + docker + sccache + Azure all play nicely together

ScottTodd commented 3 days ago

Cache scopes / namespaces / keys

sccache supports a SCCACHE_AZURE_KEY_PREFIX environment variable:

You can also define a prefix that will be prepended to the keys of all cache objects created and read within the container, effectively creating a scope. To do that use the SCCACHE_AZURE_KEY_PREFIX environment variable. This can be useful when sharing a bucket with another application.

We can use that to have a single storage account for multiple projects and that will also allow us to better manage the storage in the cloud project itself, e.g. checking the size of each folder or deleting an entire folder. Note that sccache's architecture (https://github.com/mozilla/sccache/blob/main/docs/Architecture.md) includes a sophisticated hash function which includes environment variables, the compiler binary, compiler arguments, files, etc. , so sharing a cache folder between e.g. MSVC on Windows and clang on Linux should be fine. I'd still prefer we separate those caches though.

Some naming ideas:

${PROJECT}-${JOB_NAME}, e.g. iree-linux_x64_clang
${DOCKERFILE_URL} - we currently do this for the GCP ccache namespaces, e.g. CCACHE_NAMESPACE=gcr.io/iree-oss/base-arm64@sha256:9daa1cdbbf12da8527319ece76a64d06219e04ecb99a4cff6e6364235ddf6c59
${PROJECT}-${JOB_NAME}-${LLVM_COMMIT}
${PROJECT}-${JOB_NAME}-${DATE} Our GitHub Actions cache keys (https://github.com/iree-org/iree/actions/caches) include timestamps, but those are also pruned frequently and the cache lookup operates on a prefix (https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows)

Any of the scopes that have frequently changing names should have TTLs on their files or we should audit and clean them up manually from time to time, so they don't live indefinitely.

iree-org / iree