Closed stuhood closed 2 years ago
Doing some portion of #11330 might be a prerequisite for this: if we backtracked to retry a Process
node (for example), the Runner
might need to execute differently than it did the first time: ie, it might need to actually run the Process
in the case of a remote_cache::Runner
.
EDIT: And in fact, determining what would happen differently on the second run of an invalidated Process
is probably a blocker for treating this issue as a solution to the problem rather than #11330.
The minor caveat that I had added in my previous comment is actually fairly substantial: I've expanded the section of the ticket description below the line with a big TODO. #11330 is probably not a prerequisite for this one, but I'm confident that it is much smaller, and so a good idea to do first.
It would also be good to justify implementing backtracking by confirming that it reduces the total number of bytes downloaded by clients.
It would also be good to justify implementing backtracking by confirming that it reduces the total number of bytes downloaded by clients.
Based on production experience, this aspect is unlikely to be a significant benefit, since users are seeing relatively little actually being downloaded in most cases. The larger benefit will be reducing the total number of requests, which are concurrency bounded, and introduce round trip time.
And in particular, it looks like the concurrency limit for access to the Store does actually impact total runtime for fully cached usecases, because the eager fetch of cache entries ends up taking a non-trivial amount of time.
To gauge the benefit of this change, I set remote_cache_eager_fetch=false
in pantsbuild/pants
CI and another private repository.
With eager_fetch=False
:
load_bytes_with
call (to load the Tree
for the result), which meant ~20% and ~40% fewer calls overall (but this is low for the reason below)pantsbuild/pants
repo, we downloaded 80% fewer bytes from the remote (~84MB vs ~14MB), but in the private repo, we ended up downloading more data, which was strange (see below).Investigating the strange number of bytes downloaded in the private repo, I determined that the reason for the increased byte count is that we don't dedupe load_bytes_with
calls (which we did for uploads in #12087). I've opened #15524 about that.
Lazily fetching bytes means that we won't fetch the outputs of processes, but we still do need to fetch the inputs of processes in order to speculate... and in cases where lots of processes depend on a large input, that could mean lots of redundant fetching (whereas with eager fetching, all downstream processes wait for the producing process to complete and download its output). It also means more total calls to load_bytes_with
: the number of unique load_bytes_with
calls with eager_fetch=False
is ~50% lower.
EDIT: Oh, and perhaps most importantly: pantsbuild/pants
CI completed 15% faster with eager_fetch=False
.
interface.rs
which are used in @goal_rule
s, we will actually need to propagate MissingDigest
through Python code, and re-catch it at the @rule
boundary. Should be possible, but I didn't immediately see how to add a field to a custom exception type with pyo3
.Ok, the full stack of PRs for backtracking is now posted:
While #15524 will be important for performance, it doesn't affect the safety of eager_fetch=False
, and should be cherry-pickable.
When a
Digest
is missing from theStore
, we don't currently have a way to backtrack far enough to re-compute it.Backtracking the appropriate amount is challenging to do via exception handling -- which
@rules
don't have currently, but which is still useful to discuss. No particular@rule
has enough information on its stack to know the true source of aDigest
(which might be deeply nested below one of itsawait Get
s or arguments), and even if it did determine which dependency was the source of the badDigest
, the@rule
would need an API to invalidate the relevant subgraph.Instead, we should lean further in to the fact that the
NodeOutput
type can report theDigest
s that it computed, and triggerGraph::invalidate_from_roots
for theNode
(s) that produced the missingDigest
. This will naturally dirty the nodes that depend on thoseDigest
s, causing them to be canceled and re-run.From an implementation perspective, some alternatives:
Node
wrapping code or around calls to Intrinsics that would consume astruct MissingDigests(Vec<Digest>)
variant ofFailure
that would be produced inside relevant intrinsics. The error handling would callGraph::invalidate_from_roots
to invalidate allNode
s with matching digests, and then returnFailure::Invalidated
to kill itself.Graph
to natively support invalidating arbitrary otherNode
s when aNode
fails. This might look like adding a bit more structure toNodeError
and/orNodeContext
to decide which errors should trigger additional logic to match and invalidateNode
s.TODO: A blocker for this issue being a solution to the "fall back to local execution" problem is determining what will happen differently on the second run of a
Process
, and how that different behavior will be triggered. Simply re-running the originalNode
again is likely to result in the exact same output: for example, if theDigest
was produced by a cache lookup, re-looking up in the cache will probably produce the sameDigest
again.So when invalidating, we'd also need to decide how to affect the second run of the
Process
. A few potential options there:Process
attempt histories (or even just an "attempt count") somewhere (on theSession
, or in theCommandRunner
itself?) most likely, and letCommandRunners
interact with the history to decide whether to skip caches, etc if need be. For example, a cacheCommandRunner
might skip the cache for a second attempt.CommandRunner
s" with the history, meaning that what happens inCommandRunner::run
would be dependent both on how they had been stacked/composed, and on the history.CommandRunner
s that we currently do during construction, into a method that would run per-attempt to run aProcess
. While that method would be complicated, it would be less complicated than thinking about both the interactions of lots of nested runners, and an attempt historyCommandRunners
when we encounter a missingDigest
to "drop back to local only".