Open brycenichols opened 1 year ago
Unfortunately a lot of the work that goes into dune cache trim
is actually just calculating the size. You can see this by running dune cache size
. Therefore in order to check an upper limit on size, this slow work will have to be done anyway.
Rather than adding slow size checks to dune build itself, this might be better served by a dedicated shared cache server. You can imagine when we have a server handling a distributed cache we could allow it to manage the size better than we currently do.
cc @rgrinberg
I was curious and took a look at what ccache does. From what I read (https://ccache.dev/manual/4.3.html#_cache_size_management), they maintain counters for size and number of cached files for each of 16 subdirectories in the cache. The stated reason for multiple files is performance and concurrency.
cc @snowleopard any opinion on this?
We've recently implemented some eager cache trimming in Jenga, which is not exactly what is being asked here but along the lines of making trimming happen during the builds, as opposed to being scheduled.
Personally, I'd welcome some work in this direction, though ideally it would happen after our internal migration from Jenga to Dune is complete (maybe in a few months), as I expect any changes around caching to be pretty disruptive right now.
I started having a look at this.
There are definitely design decisions to take because the naive solution of calling dune cache trim
at the end of the build isn't going to cut it.
@emillon On Unix we have du -sh
which should do the job nicely. There is a Windows equivalent I can elaborate on if you would like. They would both give a good estimate of the size of the cache.
If the cache limit is set, we could try running these commands when dune exits (not sure about watch mode). It would then do a rough comparison and tell the user that they should run dune cache trim
(no options).
We would then make dune cache trim
do something clever like trim the cache to 75% of the max limit.
Sorry, for that part I meant that there might be FS-specific operations to give estimates without calling stat
on the whole hierarchy (which is what du -sh
does). pg_total_relation_size
vs select count(*)
.
We could cache the stat calls that du
does, which would be a good speedup between runs. Windows also has indexing for search which we could use there.
The right fix is probably to keep track of the size of the cache as we're writing to it. That's an invasive change that we shouldn't undertake at the moment.
In the meantime, I would suggest that we implement eager cache trimming and see how far that gets us.
@pmwhite do you think you could import eager cache trimming?
Yeah, once it is implemented, I can try importing it.
Unless I'm misremembering, I think it's already implemented.
Yes I was about to ask for clarification about what is meant by "eager cache trimming".
We implemented the "eager cache trimming" feature in Jenga internally. When Jenga runs an action, it deletes the previous versions of the action's targets from the cache, it they are unused in other workspaces. It works pretty well in practice, especially if you keep tweaking a test over and over (in which case you often end up with dozens of old versions of the test runner binary in the cache).
We plan to implement "eager cache trimming" in Dune too and upstream it in the next month or so.
Desired Behavior
We would like for dune to have a config option for a maximum cache size. This size can be maintained as builds proceed so that the user need not worry about how quickly the cache grows or maintain out-of-band processes to periodically trim the cache.
Motivating use case
At Ahrefs, we have a pool of persistent build hosts to handle many build jobs in parallel, and take advantage of a persistent git monorepo clone and cached build files re-used in subsequent build job invocations. We run a buildkite agent on every build host. As well as re-using the state of the environment after some setup steps, we benefit from using dune cache. The problem is that the cache can grow without bound unless it is trimmed periodically. While we could introduce a build step either before or after a build job to run the trim operation, this eliminates much of the benefit of improved speed from using the cache. When trimming to 50GB the process can take a couple of minutes and that just adds to the total time to complete the build job.
To deal with the issue of an ever-expanding cache without introducing extra time-consuming steps in our pipeline, we currently schedule a dedicated trim pipeline to run on all known agents. The issue with this is that there's no way to know the appropriate scheduling for these out-of-band processes and they end up periodically blocking availability of the agents, particularly when load is high and they are most needed. Furthermore, we have found the step of querying and/or maintaining the list of agents to be brittle. Buildkite, like similar build scheduling tools is designed to hand out work to any one available agent, not script something to run on all of them. Agents may be added, disconnected, disabled, or re-enabled at any time so it doesn't make sense to run any given process across the whole pool.
This brings us to this request. The very nice to have (and in the spirit of the original design for the dune cache) would be for the trimming to happen as-needed in real-time so that the user is guaranteed some upper limit on the size without needing to always run the trim process and incur some cost each time.