configuration option for maximum dune cache size

brycenichols commented 1 year ago

Desired Behavior

We would like for dune to have a config option for a maximum cache size. This size can be maintained as builds proceed so that the user need not worry about how quickly the cache grows or maintain out-of-band processes to periodically trim the cache.

Motivating use case

At Ahrefs, we have a pool of persistent build hosts to handle many build jobs in parallel, and take advantage of a persistent git monorepo clone and cached build files re-used in subsequent build job invocations. We run a buildkite agent on every build host. As well as re-using the state of the environment after some setup steps, we benefit from using dune cache. The problem is that the cache can grow without bound unless it is trimmed periodically. While we could introduce a build step either before or after a build job to run the trim operation, this eliminates much of the benefit of improved speed from using the cache. When trimming to 50GB the process can take a couple of minutes and that just adds to the total time to complete the build job.

To deal with the issue of an ever-expanding cache without introducing extra time-consuming steps in our pipeline, we currently schedule a dedicated trim pipeline to run on all known agents. The issue with this is that there's no way to know the appropriate scheduling for these out-of-band processes and they end up periodically blocking availability of the agents, particularly when load is high and they are most needed. Furthermore, we have found the step of querying and/or maintaining the list of agents to be brittle. Buildkite, like similar build scheduling tools is designed to hand out work to any one available agent, not script something to run on all of them. Agents may be added, disconnected, disabled, or re-enabled at any time so it doesn't make sense to run any given process across the whole pool.

This brings us to this request. The very nice to have (and in the spirit of the original design for the dune cache) would be for the trimming to happen as-needed in real-time so that the user is guaranteed some upper limit on the size without needing to always run the trim process and incur some cost each time.

Alizter commented 1 year ago

Unfortunately a lot of the work that goes into dune cache trim is actually just calculating the size. You can see this by running dune cache size. Therefore in order to check an upper limit on size, this slow work will have to be done anyway.

Rather than adding slow size checks to dune build itself, this might be better served by a dedicated shared cache server. You can imagine when we have a server handling a distributed cache we could allow it to manage the size better than we currently do.

cc @rgrinberg

brycenichols commented 1 year ago

I was curious and took a look at what ccache does. From what I read (https://ccache.dev/manual/4.3.html#_cache_size_management), they maintain counters for size and number of cached files for each of 16 subdirectories in the cache. The stated reason for multiple files is performance and concurrency.

Alizter commented 1 year ago

cc @snowleopard any opinion on this?

snowleopard commented 1 year ago

We've recently implemented some eager cache trimming in Jenga, which is not exactly what is being asked here but along the lines of making trimming happen during the builds, as opposed to being scheduled.

Personally, I'd welcome some work in this direction, though ideally it would happen after our internal migration from Jenga to Dune is complete (maybe in a few months), as I expect any changes around caching to be pretty disruptive right now.

emillon commented 1 year ago

I started having a look at this. There are definitely design decisions to take because the naive solution of calling dune cache trim at the end of the build isn't going to cut it.

Having an estimate of the cache size is very useful. We could have a special file store that, which would be updated whenever a full iteration on the FS is done (manual trims, or when an automatic one actually triggers). Because of hard links, it might be difficult to keep this up to date in each build.
I wonder if there are platform/FS specific APIs we could use to do size estimation.
We need to think of what needs to happen when the cache is full: we probably don't need to try very hard to reclaim space file per file to keep the cache as full as possible; rather I would think that some form of hysteresis would be useful: when the cache hits a certain threshold (say it grows to 10.1GB where the limit is 10GB), instead of removing the bare minimum to keep it below the limit, do a larger cleanup to get it under a lower limit (say 8GB).
The idea of having sub-caches is neat because it makes estimation cheaper.

Alizter commented 1 year ago

@emillon On Unix we have du -sh which should do the job nicely. There is a Windows equivalent I can elaborate on if you would like. They would both give a good estimate of the size of the cache.

If the cache limit is set, we could try running these commands when dune exits (not sure about watch mode). It would then do a rough comparison and tell the user that they should run dune cache trim (no options).

We would then make dune cache trim do something clever like trim the cache to 75% of the max limit.

emillon commented 1 year ago

Sorry, for that part I meant that there might be FS-specific operations to give estimates without calling stat on the whole hierarchy (which is what du -sh does). pg_total_relation_size vs select count(*).

Alizter commented 1 year ago

We could cache the stat calls that du does, which would be a good speedup between runs. Windows also has indexing for search which we could use there.

rgrinberg commented 1 year ago

The right fix is probably to keep track of the size of the cache as we're writing to it. That's an invasive change that we shouldn't undertake at the moment.

In the meantime, I would suggest that we implement eager cache trimming and see how far that gets us.

@pmwhite do you think you could import eager cache trimming?

pmwhite commented 1 year ago

Yeah, once it is implemented, I can try importing it.

rgrinberg commented 1 year ago

Unless I'm misremembering, I think it's already implemented.

emillon commented 1 year ago

Yes I was about to ask for clarification about what is meant by "eager cache trimming".

snowleopard commented 1 year ago

We implemented the "eager cache trimming" feature in Jenga internally. When Jenga runs an action, it deletes the previous versions of the action's targets from the cache, it they are unused in other workspaces. It works pretty well in practice, especially if you keep tweaking a test over and over (in which case you often end up with dozens of old versions of the test runner binary in the cache).

We plan to implement "eager cache trimming" in Dune too and upstream it in the next month or so.

ocaml / dune

configuration option for maximum dune cache size #8274

Desired Behavior

Motivating use case