Open ScottTodd opened 4 weeks ago
Experiments are showing that local ccache using github actions is going to be nowhere near functional for some of the current CI builds. Maybe I have something misconfigured, but I'm seeing cache sizes of up to 2GB still not be enough for Debug or ASan jobs. I can try running with no cache limit to see what that produces, but GitHub's soft limit of 10GB across all cache entries before it starts evicting entries will trigger very frequently if we have too many jobs using unique cache keys.
Experiments so far:
I have gone through https://github.com/actions/actions-runner-controller and gave it a try through a basic POC but many things still aren't working yet.
To replicate what I've done so far:
These all work fairly out of the box. Few suggestions:
Currently blocked - getting images working. Going to keep trying to work on this but may pull someone in to help at this point since the k8s part is at least figured out.
I created https://github.com/iree-org/base-docker-images and am working to migrate what's left in https://github.com/iree-org/iree/tree/main/build_tools/docker to that repo. Starting with a few workflows that don't have special GCP requirements right now like https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml.
Local testing of https://github.com/iree-org/base-docker-images/pull/4 looks promising to replace gcr.io/iree-oss/base
with a new ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64
(or we can just put ghcr.io/iree-org/cpubuilder_ubuntu_jammy_ghr_x86_64
on the cluster for those builds, instead of using Docker inside Docker).
We could also try using the manylinux image but I'm not sure if we should expect that to work well enough with the base C++ toolchains outside of python packaging. I gave that a try locally too but got errors like:
# python3 -m pip install -r ./runtime/bindings/python/iree/runtime/build_requirements.txt
WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead.
Collecting pip>=21.3 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 6))
Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
100% |████████████████████████████████| 1.7MB 1.6MB/s
Collecting setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7))
Could not find a version that satisfies the requirement setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)) (from versions: 0.6b1, 0.6b2, 0.6b3, 0.6b4, 0.6rc1, ...
... 59.3.0, 59.4.0, 59.5.0, 59.6.0)
No matching distribution found for setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)
If we're not sure how we want to set up a remote cache by the time we want to transition, I could at least prep a PR that switches relevant workflows to stop using a remote cache.
Shared branch tracking the migration: https://github.com/iree-org/iree/tree/shared/runner-cluster-migration
That currently switches the runs-on:
for multiple jobs to the new cluster and changes some workflows from using the GCP cache to using no cache. We'll try setting up a new cache and continue testing there before merging to main
.
We're still figuring out how to get build times back to reasonable on the new cluster by configuring some sort of cache. The linux_x64_clang
build is taking around 30 minutes for the entire job on the new runner cluster with no cache, compared to 9 minutes for the entire job on old runners with a cache.
ccache (https://ccache.dev/) does not have first class support for Azure Blob Storage, so we are trying a few things:
blobfuse2
(https://github.com/Azure/azure-storage-fuse) to mount the remote directory and treat it as local (blobfuse2 mount ... /mnt/azureblob
+ CCACHE_DIR=/mnt/azureblob/ccache-container
), but that has some confusing configuration and doesn't appear to support multiple concurrent readers/writers:
Blobfuse2 supports both reads and writes however, it does not guarantee continuous sync of data written to storage using other APIs or other mounts of Blobfuse2. For data integrity it is recommended that multiple sources do not modify the same blob/file.
sccache (https://github.com/mozilla/sccache) is promising since it does have first class support for Azure Blob Storage: https://github.com/mozilla/sccache/blob/main/docs/Azure.md
Either way we still need to figure out the security/access model. Ideally we'd have public read access the cache, but we might need to limit even that if the APIs aren't available. Might have to make some (temporary?) tradeoffs where only PRs sent from the main repo would get access to the cache via GitHub Secrets (which aren't shared with PRs from forks) :slightly_frowning_face:
As a data point I've used sccache locally and it worked as expected for our cmake builds.
Yep I just had good results with sccache locally on Linux and using Azure. I think good next steps are:
sccache supports a SCCACHE_AZURE_KEY_PREFIX
environment variable:
You can also define a prefix that will be prepended to the keys of all cache objects created and read within the container, effectively creating a scope. To do that use the
SCCACHE_AZURE_KEY_PREFIX
environment variable. This can be useful when sharing a bucket with another application.
We can use that to have a single storage account for multiple projects and that will also allow us to better manage the storage in the cloud project itself, e.g. checking the size of each folder or deleting an entire folder. Note that sccache's architecture (https://github.com/mozilla/sccache/blob/main/docs/Architecture.md) includes a sophisticated hash function which includes environment variables, the compiler binary, compiler arguments, files, etc. , so sharing a cache folder between e.g. MSVC on Windows and clang on Linux should be fine. I'd still prefer we separate those caches though.
Some naming ideas:
${PROJECT}-${JOB_NAME}
, e.g. iree-linux_x64_clang
${DOCKERFILE_URL}
- we currently do this for the GCP ccache namespaces, e.g. CCACHE_NAMESPACE=gcr.io/iree-oss/base-arm64@sha256:9daa1cdbbf12da8527319ece76a64d06219e04ecb99a4cff6e6364235ddf6c59
${PROJECT}-${JOB_NAME}-${LLVM_COMMIT}
${PROJECT}-${JOB_NAME}-${DATE}
Our GitHub Actions cache keys (https://github.com/iree-org/iree/actions/caches) include timestamps, but those are also pruned frequently and the cache lookup operates on a prefix (https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows)Any of the scopes that have frequently changing names should have TTLs on their files or we should audit and clean them up manually from time to time, so they don't live indefinitely.
Following the work at https://github.com/iree-org/iree/issues/17957 and https://github.com/iree-org/iree/issues/16203, it is just about time to migrate away from the GitHub Actions runners hosted on Google Cloud Platform.
Workflow refactoring tasks
Refactor workflows such that they don't depend on GCP:
gcloud
commandhttp://storage.googleapis.com/iree-sccache/ccache
(configured usingsetup_ccache.sh
)build_tools/github_actions/docker_run.sh
scriptRunner setup tasks
Transition tasks
Switch all jobs that need a self hosted runner to the new runners
linux_x86_64_release_packages
inpkgci_build_packages.yml
linux_x64_clang
inci_linux_x64_clang.yml
linux_x64_clang_asan
inci_linux_x64_clang_asan.yml
linux_x64_clang_tsan
inci_linux_x64_clang_tsan.yml
linux_x64_clang_debug
inci_linux_x64_clang_debug.yml
build_test_all_bazel
inci.yml
linux_arm64_clang
inci_linux_arm64_clang.yml
build_packages
(arm64) inbuild_package.yml
test
inpkgci_test_nvidia_t4.yml
nvidiagpu_cuda
inpkgci_regression_test.yml
nvidiagpu_vulkan
inpkgci_regression_test.yml
Other