gc: Determine max-size use cases, and consider design changes

ehuss commented 11 months ago

The current implementation from #12634 has support for manual cleaning of cache data based on size. This is just a preliminary implementation, with the intent to gather better use cases to understand how size-based cleaning should work.

Some considerations for changes:

What are the use cases for cleaning based on size?
What should be included in size calculations (src, cache, git db, git co, indexes)?
Does the current max-size behavior make sense? It combines src,cache and sorts on time. But should it maybe have higher precedence to delete src first (since it can be recreated locally)?
- Should it include indexes (It doesn't because deleting an index seems like it would not be something that you would want, and there are relatively few indexes).
What should be the priorities for what it cleans first?
What should be the CLI options be for specifying manual size cleaning?
Should --max-download-size include git caches?
Should automatic gc support size-based cleaning?
- This may be tricky to implement, since you don't want to delete anything that is being used by the current build, and could potentially cause cache thrashing.
- It's not clear to me at which point during the build it should do a size-based cleaning.
- Tracking size data is slow.
The current du implementation is primitive, and doesn't know about block sizes, and thus vastly undercounts the disk usage of small files. Should it be updated or replaced to use a better implementation? (It also doesn't handle hard-links, see #13064).

polarathene commented 8 months ago

should it maybe have higher precedence to delete src first (since it can be recreated locally)

Yes.

Advice for CI caching is already to avoid caching registry/src/, it's also often where the weight is between the two.

In a simple program 2MB of registry/cache and 17MB of registry/src/.
cargo-cache trim --limit 13M presently seems to go for registry/cache/ first for some reason 🤷‍♂️ (presumably alphabetic sort order)

What should be included in size calculations (src, cache, git db, git co, indexes)?

It may be better less options than I'm currently seeing in the nightly CLI support. Rather than explicit settings for each, the user could have an option to customize a list of accepted values to group the size/time calculation on. Priority could also be derived from the order there? 🤷‍♂️

Not sure how many users would need separate granular controls for different time/size limits.

What are the use cases for cleaning based on size?

For my personal use, I have been on a system with only about 30GB disk space available for many months, so it's juggling what to delete as that gets used up, and hunting down what is eating up that usage. I tend to use Docker as a way to more easily control where that size accumulates and can be easily discarded.

For CI, 90 days is quite a bit of time for cache to linger. Github Actions IIRC has 10GB size limit and retains an uploaded cache entry for 7 days (possibly not if it was still being used, I forget). With each change to the cache uploading a new entry, the size accumulates over time, it can be better to keep relevant cache around for as long as possible, but not wastefully keeping extra cache that contributes to that shared 10GB cache storage and affects upload/download time.

You could bring down the min age, and that'd be fine if it resets the age when the relevant cache is used. I don't know if CI has atime set, on my personal systems I usually have noatime. I'm not quite sure how the time feature works. I'd have thought it was mtime based but that doesn't seem to be the case.

Observations:

cargo +nightly clean gc --dry-run -Z gc --max-crate-age "1 seconds" (same with 0 seconds) or any other age appears to have no effect. This is run within a container of the official Rust docker image via WSL2 host running Docker Desktop.

--max-download-size doesn't communicate impact on registry/cache well:

$ cargo +nightly clean gc --dry-run -Z gc --max-download-size 0MB --verbose | grep cache
/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/supports-color-2.1.0/tests/cached.rs
/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/util/cacheline.rs
    Summary 1447 files, 14.8MiB total

$ du -shlx $CARGO_HOME/registry/cache
2.4M    /usr/local/cargo/registry/cache

$ du -shlx $CARGO_HOME/registry/src
18M     /usr/local/cargo/registry/src
# Better match cargo output:
$ du -sh --apparent-size $CARGO_HOME/registry/src
14M     /usr/local/cargo/registry/src

# Verbosity is non-existent here:
$ cargo +nightly clean gc --dry-run -Z gc --max-crate-size 0MB --verbose
    Summary 18 files, 2.3MiB total
# Verbosity avoided as it's excessive, unlike cargo-cache which only displays crates (not their individual files)
$ cargo +nightly clean gc --dry-run -Z gc --max-src-size 0MB
    Summary 1429 files, 12.5MiB total

Presumably an entire crate in registry/src/ would be removed from the cache, not a portion of it's contents? Not sure why the full contents being listed per crate makes sense as the first --verbose option, bit much when you want a little bit more information.

What should be the priorities for what it cleans first?

If you have information about how relevant a crate is to keep based on metrics you're tracking in the DB, that should work for removing least useful crates. Perhaps for registry/src/ if you have the uncompressed size ratio to registry/cache/ it might be useful a useful metric? Alternatively when a crate has multiple versions, removing the older ones may make the most sense?

If the user is invoking a max age or max size limit, you could also weight decisions based on that. For CI I'd rather clear out registry/src/ and maintain an age + size limit, if the size threshold was hit, any stale data by time is probably the way to go.

Should --max-download-size include git caches?

If you look at the docker init command with Rust generated Dockerfile, they presently have two cache mounts, one for git db and another for registry. Earthly lib/rust function is more granular IIRC with cache mounts and runs cargo-sweep with a max age of 4 days on target/.

I think it makes sense to include the git cache as well. Especially now that the registry is using sparse checkout? I'm not too familiar with it beyond using git for some dependencies instead of a published release, where there's overlap that it makes sense to me to bundle it into that setting.

Should automatic gc support size-based cleaning?

That'd be nice. For personal systems, I often run low on disk space multiple times through the year especially with some issues when using WSL2.

It's often workload specific and I don't always recognize when it eats up another 10-20GB in a short time which can be problematic. That's not specifically due to cargo cache size, but it can be helpful towards avoiding the consequence (if the Windows host runs out of disk space, WSL2 becomes unresponsive and requires the system to reboot AFAIK, process won't terminate nor restore once you've freed disk space).

You can also reference systemd journal that has similar limit to trigger a clean (around 4GB default IIRC). I don't know if this would be suitable to have as a default, maybe should be opt-in config, or is aware of available disk-space relative to disk size (Docker Desktop can be misleading here, with WSL2 mounts that set a much larger disk size than is actually available, and that'd carry over into Docker IIRC).

This may be tricky to implement, since you don't want to delete anything that is being used by the current build, and could potentially cause cache thrashing.

On a personal system, I'm only really concerned about it when I'm at risk of underestimating how much disk will get used (WSL2 uses up disk space based on memory usage for the disk cache buffer too).

You could perhaps defer the cleanup, or on a linux system it could be run via a systemd-timer/cron task when system appears idle. That sort of thing technically doesn't need official support either if an external tool/command can get the cache size information to invoke cargo gc.

The current du implementation is primitive, and doesn't know about block sizes, and thus vastly undercounts the disk usage of small files.

You may also have reflinks, similar to hardlinks but CoW capable. While it'd be nice for the better accuracy, I wouldn't consider it a blocker for the valuable gc command.

I read something about access time monitoring (for one of the third-party tools I think), where I have a concern for relying on that, as noatime isn't that uncommon for linux filesystems to have AFAIK?

Reference: Docker Prune

Docker offers similar with: - `docker images prune` - `docker containers prune` - `docker system prune` - etc With `docker system` being the broader scope vs only pruning disposable data from a specific source. There's also the `prune --all` option to be more aggressive, along with a flexible filter query arg to target by age or other criteria.

epage commented 8 months ago

For CI, 90 days is quite a bit of time for cache to linger. Github Actions IIRC has 10GB size limit and retains an uploaded cache entry for 7 days (possibly not if it was still being used, I forget). With each change to the cache uploading a new entry, the size accumulates over time, it can be better to keep relevant cache around for as long as possible, but not wastefully keeping extra cache that contributes to that shared 10GB cache storage and affects upload/download time.

I feel like the best way to handle CI is being able to somehow specify "clear everything that wasn't used wiithin this CI job. So long as we have a way to formulate that query, size mostly doesn't matter.

polarathene commented 8 months ago

clear everything that wasn't used within this CI job

Potentially, but look at the Earthly blog on this topic and how they describe a remote CI runner that doesn't upload/download remote cache for supporting multiple CI build jobs.

You could still use a remote cache that multiple runners could share if it's not too large but has enough common dependencies too?

Perhaps an example for that would be with the docker init Rust template which provides a Dockerfile that uses three cache mounts. two are related to CARGO_HOME cache (git/ and registry/) while the third is for target/ (keyed by target and project). That would share common cache across multiple builds in the same CI job, but via containers.

Since it's during a concurrent docker image build, it's a bit difficult to access the cache mount afterwards. I suppose one could force an image build afterwards for cleanup to access the cache mount, might be a little tricky/awkward though? 🤷‍♂️

If using a matrix build to split a job across separate runners, I guess while they could all pull the same remote cache item, they can't easily upload one with different writes unique to those builds like they could in the concurrent build job.

Given the above, maybe a time based eviction policy is still a valid approach, just one that tracks that time for stale cache to evict. In browser caches there's quite a few options to manage this on the client-side, they have an etag for a resource and a cache policy like stale-while-revalidate. Perhaps cargo could do something similar and not clear a cache item early but put it in a pending state for removal, so long as nothing else uses the cache item until the lock file is released? That might be relevant to the rustc concern, not sure?

On linux for memory compression there is also ZRAM. It has a feature with a backing store on disk that can move stale pages to, it just takes a command that marks all current pages it compresses and likewise nothing happens until the 2nd run. Anything still marked for removal is removed, while anything that had been used since had that marker discarded by this point, after the marked content is dropped, the unmarked content is all marked again and that process repeats.

Would that work?

Mark all items in cache with a timestamp that don't have one
When those items are used reset their timestamp (perhaps also increment some counter for tracking frequency as another weight for removal)
Check for items that are stale (exceeded max age) and evict those that are.

That way you can have a low cache expiry while still keeping actively used items?

You could also have a size threshold to skip all that if the overhead would be wasteful.
I assume overhead wouldn't be too high if you're short-circuiting the query for stale items to only return the first result, and until nearing expiry future queries could simply check that directly as the oldest entry.
Eviction by time/size threshold could be the upper bound, with a lower bound target. You'd see this sort of thing with LetsEncrypt certs having 90 day expiry, but often renewed eagerly at 60 days. So for a cache you might have a 90 day expiry for personal systems cache, that evicts anything 60 days or older. Only needs to run that again 30 days later? Similar for size thresholds.

Docker has a automatic GC of it's own with configurable policies: https://docs.docker.com/build/cache/garbage-collection/

rust-lang / cargo

gc: Determine max-size use cases, and consider design changes #13062