Use GHA's caching mechanism to save package manager caches between runs

vyasr commented 4 months ago

Currently RAPIDS CI jobs spend a significant amount of time constructing environments, whether they be pip or conda. A meaningful chunk of this time is spent downloading packages from remote sources. Aside from the inherent wastefulness in time and network bandwidth, these downloads also expose us to more network connectivity issues, which have plagued our CI in general.

We should investigate using Github's native dependency caching functionality. GHA recommends caching for specific package managers using the corresponding setup-* scripts, but those are more general tools intended to actually set up the installation of those package managers as well. Since we will have those package managers installed into our base images, we will have to manage the caching directly. That shouldn't be too difficult though; we simply need to construct a suitable cache key corresponding to the path to each package manager's local cache (e.g. /opt/conda/pkgs for conda).

We will need to figure out what makes the most sense to put into a cache key. One option would be to use a single cache for all conda packages across our entire matrix of jobs, but that would mean sharing a cache between different architectures and CUDA versions, which may not be ideal. The opposite alternative would be having a separate cache for every matrix entry in a job (e.g. arch/CUDA version/Python version). In general, we'll need to balance cache size (which should speed up cache upload/download), contention (I don't know how well GHA handles every PR in a repo trying to upload or download the exact same cache simultaneously, hopefully that's optimized well but we'll have to test), and cache hit rate (if different jobs have partial overlap in their dependencies, then using a shared cache will increase the hit rate).

jjacobelli commented 3 months ago

Using GH native caching feature may not work as expected because RAPIDS is using self-hosted runners. When using caching with self-hosted runners, the cache is stored on GitHub-owned cloud storage, which means the runners will still need to download the cache from this storage for every run. From GH documentation: We are investigating to add some caching at the runner level for package managers like pip or conda

ajschmidt8 commented 3 months ago

thanks Vyas for bringing this issue to my attention.

Jordan's comment is correct. Caching dependencies with GitHub's native solution doesn't really work for self-hosted runners. there is a community issue about it below:

https://github.com/orgs/community/discussions/18549

We are working on a NGINX caching proxy that can be used to cache pip and conda packages close to our self-hosted runners. We are still in the testing phases, but we will be sure to broadcast the feature when it's ready.

Until then, I would recommend that no one work on this issue.

rapidsai / build-planning

Use GHA's caching mechanism to save package manager caches between runs #51