Open encukou opened 8 months ago
It's not possible to change the contents of an existing cache, so for this to be useful we need to save a new one occasionally. With https://github.com/python/cpython/pull/113859 it should save only on merged commits, not in-progress PRs. Hopefully that'll be enough.
Another idea is to set a max-size
, but the above PR looks like a good thing to try first π
Some napkin math suggests that this should be enough, at least outside sprints. Which is good enough :) I don't think we want to delete the caches manually, so we'll be at GitHub's 10GiB limit regardless of max-size.
I plan to merge tomorrow morning (EU time) so I have the whole day to watch the effect (and revert if needed).
Is there more to do here? (I accidentally clicked "Close with comment" iso. "Comment"; so, please reopen if there is more to do here π)
Yup, there is :)
Please keep the issue open for another possible iteration. This can't really be tested locally :)
Please keep the issue open for another possible iteration. This can't really be tested locally :)
It's easy to forget to close it; how about re-opening if another iteration is needed?
But if I forget to close the issue, that'd mean I also forgot to check that the problem is fixed. We'll know in a day or two.
Currently, the oldest caches are from about 2 days ago, and tase up about 30GB.
βNewβ ccaches (from main
) are 4.7GiB, βoldβ ccaches are 24.8G, others are 1.5G.
On a few recent PRs, the main build has a 90+ % cache hit rate.
Looks like if I backport the 2 PRs, 10G will be enough for about two days of caches. We could go further by writing an Action or bot to remove old ccache, but IMO GitHub's automatic cleanup will be enough for now.
You mean megabytes, not gigabytes, right?
Gigabytes. These are sums of all such caches, not the individual entries.
I'd like to keep this open for a few more days, to check if everything is in order.
Unfortunately, 10G of caches is now only enough for about 12 hours. Much better than before, but, not enough for a task that runs every 24 hours.
The next step is writing code that uses GHA API to clear older ccaches. I'll keep it on my TODO list, but I'm not sure when I'll get to it.
The cache size limit is now 200 MB.
Do we need it so big? On my fork, they were between 32-75 MB.
I guess my questions are:
If we halved it, that might be enough for around 24 hours, but I expect we'd still get some blips that mean the daily cron task loses its cache and has to start afresh.
Yeah, the current approach could barely get us to where we want to be. IMO it's time to cut the losses and try the other approach. IMO, the best direction is to put branch names in the cache names, and write logic to clear all ccaches but the freshest one for each branch.
Sure, sounds good.
From the other side of the fence, we could adjust the stale.yml
cron to run more often than daily: https://github.com/python/cpython/blob/main/.github/workflows/stale.yml
Please see https://github.com/python/cpython/pull/116636 to process the stale issues twice per day, which hopefully will keep its cache file fresh and prevent deletion.
Twice per day isn't enough, let's try 4x: https://github.com/python/cpython/pull/116857.
4x doesn't help either. The problem now isn't that the stale cache has been deleted (for example, there's ~30 hours of caches available right now), but https://github.com/actions/stale only checks the first page of results from the cache API!
For example, https://api.github.com/repos/python/cpython/actions/caches only has ~3 hours of caches at the moment.
Upstream bug report is at https://github.com/actions/stale/issues/1136
The
ccache
action speeds up *nix builds considerably (though not enough that we could do without it). The issue with it is that every build uploads a new cache, filling up our quota very quickly. GitHub removes caches that weren't used recently, but ccache is aggressive enough that it's crowding out smaller, useful entries. Specifically, the action to mark stale-PRs (which only runs every once in a while), has its state removed before the next run.main
. (It's hard to test locally, iterating onmain
seems to be the best way to do it. Sorry for the noise!)Linked PRs