iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 571 forks source link

Migrate jobs off current GCP GHA runner cluster #18238

Open ScottTodd opened 4 weeks ago

ScottTodd commented 4 weeks ago

Following the work at https://github.com/iree-org/iree/issues/17957 and https://github.com/iree-org/iree/issues/16203, it is just about time to migrate away from the GitHub Actions runners hosted on Google Cloud Platform.

Workflow refactoring tasks

Refactor workflows such that they don't depend on GCP:

Runner setup tasks

Transition tasks

Switch all jobs that need a self hosted runner to the new runners

Other

ScottTodd commented 4 weeks ago

Experiments are showing that local ccache using github actions is going to be nowhere near functional for some of the current CI builds. Maybe I have something misconfigured, but I'm seeing cache sizes of up to 2GB still not be enough for Debug or ASan jobs. I can try running with no cache limit to see what that produces, but GitHub's soft limit of 10GB across all cache entries before it starts evicting entries will trigger very frequently if we have too many jobs using unique cache keys.

amd-chrissosa commented 3 weeks ago

Experiments so far:

I have gone through https://github.com/actions/actions-runner-controller and gave it a try through a basic POC but many things still aren't working yet.

To replicate what I've done so far:

These all work fairly out of the box. Few suggestions:

Currently blocked - getting images working. Going to keep trying to work on this but may pull someone in to help at this point since the k8s part is at least figured out.

ScottTodd commented 3 weeks ago

I created https://github.com/iree-org/base-docker-images and am working to migrate what's left in https://github.com/iree-org/iree/tree/main/build_tools/docker to that repo. Starting with a few workflows that don't have special GCP requirements right now like https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml.

Local testing of https://github.com/iree-org/base-docker-images/pull/4 looks promising to replace gcr.io/iree-oss/base with a new ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64 (or we can just put ghcr.io/iree-org/cpubuilder_ubuntu_jammy_ghr_x86_64 on the cluster for those builds, instead of using Docker inside Docker).

We could also try using the manylinux image but I'm not sure if we should expect that to work well enough with the base C++ toolchains outside of python packaging. I gave that a try locally too but got errors like:

# python3 -m pip install -r ./runtime/bindings/python/iree/runtime/build_requirements.txt
WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead.
Collecting pip>=21.3 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 1.6MB/s 
Collecting setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7))
  Could not find a version that satisfies the requirement setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)) (from versions: 0.6b1, 0.6b2, 0.6b3, 0.6b4, 0.6rc1, ...
... 59.3.0, 59.4.0, 59.5.0, 59.6.0)
No matching distribution found for setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)
ScottTodd commented 3 weeks ago

If we're not sure how we want to set up a remote cache by the time we want to transition, I could at least prep a PR that switches relevant workflows to stop using a remote cache.

ScottTodd commented 2 weeks ago

Shared branch tracking the migration: https://github.com/iree-org/iree/tree/shared/runner-cluster-migration

That currently switches the runs-on: for multiple jobs to the new cluster and changes some workflows from using the GCP cache to using no cache. We'll try setting up a new cache and continue testing there before merging to main.

ScottTodd commented 4 days ago

We're still figuring out how to get build times back to reasonable on the new cluster by configuring some sort of cache. The linux_x64_clang build is taking around 30 minutes for the entire job on the new runner cluster with no cache, compared to 9 minutes for the entire job on old runners with a cache.

ccache (https://ccache.dev/) does not have first class support for Azure Blob Storage, so we are trying a few things:

sccache (https://github.com/mozilla/sccache) is promising since it does have first class support for Azure Blob Storage: https://github.com/mozilla/sccache/blob/main/docs/Azure.md

Either way we still need to figure out the security/access model. Ideally we'd have public read access the cache, but we might need to limit even that if the APIs aren't available. Might have to make some (temporary?) tradeoffs where only PRs sent from the main repo would get access to the cache via GitHub Secrets (which aren't shared with PRs from forks) :slightly_frowning_face:

benvanik commented 4 days ago

As a data point I've used sccache locally and it worked as expected for our cmake builds.

ScottTodd commented 4 days ago

Yep I just had good results with sccache locally on Linux and using Azure. I think good next steps are:

  1. Install sccache in the dockerfiles: https://github.com/iree-org/base-docker-images/pull/8
  2. Test sccache inside Docker (or skip this step if confident in the cache hit rates and such)
  3. Switch the test PR (https://github.com/iree-org/iree/pull/18466) to use sccache instead of ccache and confirm that github actions + docker + sccache + Azure all play nicely together
ScottTodd commented 3 days ago

Cache scopes / namespaces / keys

sccache supports a SCCACHE_AZURE_KEY_PREFIX environment variable:

You can also define a prefix that will be prepended to the keys of all cache objects created and read within the container, effectively creating a scope. To do that use the SCCACHE_AZURE_KEY_PREFIX environment variable. This can be useful when sharing a bucket with another application.

We can use that to have a single storage account for multiple projects and that will also allow us to better manage the storage in the cloud project itself, e.g. checking the size of each folder or deleting an entire folder. Note that sccache's architecture (https://github.com/mozilla/sccache/blob/main/docs/Architecture.md) includes a sophisticated hash function which includes environment variables, the compiler binary, compiler arguments, files, etc. , so sharing a cache folder between e.g. MSVC on Windows and clang on Linux should be fine. I'd still prefer we separate those caches though.

Some naming ideas:

Any of the scopes that have frequently changing names should have TTLs on their files or we should audit and clean them up manually from time to time, so they don't live indefinitely.