rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.75k stars 2.42k forks source link

gc: Determine max-size use cases, and consider design changes #13062

Open ehuss opened 11 months ago

ehuss commented 11 months ago

The current implementation from #12634 has support for manual cleaning of cache data based on size. This is just a preliminary implementation, with the intent to gather better use cases to understand how size-based cleaning should work.

Some considerations for changes:

polarathene commented 8 months ago
  • should it maybe have higher precedence to delete src first (since it can be recreated locally)

Yes.

Advice for CI caching is already to avoid caching registry/src/, it's also often where the weight is between the two.

What should be included in size calculations (src, cache, git db, git co, indexes)?

It may be better less options than I'm currently seeing in the nightly CLI support. Rather than explicit settings for each, the user could have an option to customize a list of accepted values to group the size/time calculation on. Priority could also be derived from the order there? 🤷‍♂️

Not sure how many users would need separate granular controls for different time/size limits.

What are the use cases for cleaning based on size?

For my personal use, I have been on a system with only about 30GB disk space available for many months, so it's juggling what to delete as that gets used up, and hunting down what is eating up that usage. I tend to use Docker as a way to more easily control where that size accumulates and can be easily discarded.

For CI, 90 days is quite a bit of time for cache to linger. Github Actions IIRC has 10GB size limit and retains an uploaded cache entry for 7 days (possibly not if it was still being used, I forget). With each change to the cache uploading a new entry, the size accumulates over time, it can be better to keep relevant cache around for as long as possible, but not wastefully keeping extra cache that contributes to that shared 10GB cache storage and affects upload/download time.

You could bring down the min age, and that'd be fine if it resets the age when the relevant cache is used. I don't know if CI has atime set, on my personal systems I usually have noatime. I'm not quite sure how the time feature works. I'd have thought it was mtime based but that doesn't seem to be the case.


Observations:

Presumably an entire crate in registry/src/ would be removed from the cache, not a portion of it's contents? Not sure why the full contents being listed per crate makes sense as the first --verbose option, bit much when you want a little bit more information.

What should be the priorities for what it cleans first?

If you have information about how relevant a crate is to keep based on metrics you're tracking in the DB, that should work for removing least useful crates. Perhaps for registry/src/ if you have the uncompressed size ratio to registry/cache/ it might be useful a useful metric? Alternatively when a crate has multiple versions, removing the older ones may make the most sense?

If the user is invoking a max age or max size limit, you could also weight decisions based on that. For CI I'd rather clear out registry/src/ and maintain an age + size limit, if the size threshold was hit, any stale data by time is probably the way to go.

Should --max-download-size include git caches?

If you look at the docker init command with Rust generated Dockerfile, they presently have two cache mounts, one for git db and another for registry. Earthly lib/rust function is more granular IIRC with cache mounts and runs cargo-sweep with a max age of 4 days on target/.

I think it makes sense to include the git cache as well. Especially now that the registry is using sparse checkout? I'm not too familiar with it beyond using git for some dependencies instead of a published release, where there's overlap that it makes sense to me to bundle it into that setting.

Should automatic gc support size-based cleaning?

That'd be nice. For personal systems, I often run low on disk space multiple times through the year especially with some issues when using WSL2.

It's often workload specific and I don't always recognize when it eats up another 10-20GB in a short time which can be problematic. That's not specifically due to cargo cache size, but it can be helpful towards avoiding the consequence (if the Windows host runs out of disk space, WSL2 becomes unresponsive and requires the system to reboot AFAIK, process won't terminate nor restore once you've freed disk space).

You can also reference systemd journal that has similar limit to trigger a clean (around 4GB default IIRC). I don't know if this would be suitable to have as a default, maybe should be opt-in config, or is aware of available disk-space relative to disk size (Docker Desktop can be misleading here, with WSL2 mounts that set a much larger disk size than is actually available, and that'd carry over into Docker IIRC).

This may be tricky to implement, since you don't want to delete anything that is being used by the current build, and could potentially cause cache thrashing.

On a personal system, I'm only really concerned about it when I'm at risk of underestimating how much disk will get used (WSL2 uses up disk space based on memory usage for the disk cache buffer too).

You could perhaps defer the cleanup, or on a linux system it could be run via a systemd-timer/cron task when system appears idle. That sort of thing technically doesn't need official support either if an external tool/command can get the cache size information to invoke cargo gc.

The current du implementation is primitive, and doesn't know about block sizes, and thus vastly undercounts the disk usage of small files.

You may also have reflinks, similar to hardlinks but CoW capable. While it'd be nice for the better accuracy, I wouldn't consider it a blocker for the valuable gc command.

I read something about access time monitoring (for one of the third-party tools I think), where I have a concern for relying on that, as noatime isn't that uncommon for linux filesystems to have AFAIK?


Reference: Docker Prune Docker offers similar with: - `docker images prune` - `docker containers prune` - `docker system prune` - etc With `docker system` being the broader scope vs only pruning disposable data from a specific source. There's also the `prune --all` option to be more aggressive, along with a flexible filter query arg to target by age or other criteria.
epage commented 8 months ago

For CI, 90 days is quite a bit of time for cache to linger. Github Actions IIRC has 10GB size limit and retains an uploaded cache entry for 7 days (possibly not if it was still being used, I forget). With each change to the cache uploading a new entry, the size accumulates over time, it can be better to keep relevant cache around for as long as possible, but not wastefully keeping extra cache that contributes to that shared 10GB cache storage and affects upload/download time.

I feel like the best way to handle CI is being able to somehow specify "clear everything that wasn't used wiithin this CI job. So long as we have a way to formulate that query, size mostly doesn't matter.

polarathene commented 8 months ago

clear everything that wasn't used within this CI job

Potentially, but look at the Earthly blog on this topic and how they describe a remote CI runner that doesn't upload/download remote cache for supporting multiple CI build jobs.

You could still use a remote cache that multiple runners could share if it's not too large but has enough common dependencies too?

Perhaps an example for that would be with the docker init Rust template which provides a Dockerfile that uses three cache mounts. two are related to CARGO_HOME cache (git/ and registry/) while the third is for target/ (keyed by target and project). That would share common cache across multiple builds in the same CI job, but via containers.

Since it's during a concurrent docker image build, it's a bit difficult to access the cache mount afterwards. I suppose one could force an image build afterwards for cleanup to access the cache mount, might be a little tricky/awkward though? 🤷‍♂️

If using a matrix build to split a job across separate runners, I guess while they could all pull the same remote cache item, they can't easily upload one with different writes unique to those builds like they could in the concurrent build job.


Given the above, maybe a time based eviction policy is still a valid approach, just one that tracks that time for stale cache to evict. In browser caches there's quite a few options to manage this on the client-side, they have an etag for a resource and a cache policy like stale-while-revalidate. Perhaps cargo could do something similar and not clear a cache item early but put it in a pending state for removal, so long as nothing else uses the cache item until the lock file is released? That might be relevant to the rustc concern, not sure?

On linux for memory compression there is also ZRAM. It has a feature with a backing store on disk that can move stale pages to, it just takes a command that marks all current pages it compresses and likewise nothing happens until the 2nd run. Anything still marked for removal is removed, while anything that had been used since had that marker discarded by this point, after the marked content is dropped, the unmarked content is all marked again and that process repeats.


Would that work?

That way you can have a low cache expiry while still keeping actively used items?


Docker has a automatic GC of it's own with configurable policies: https://docs.docker.com/build/cache/garbage-collection/