Closed stuhood closed 2 years ago
This showed up in profiles again recently. There is no Node
created for snapshot/digest merge operations (they're not memoized), so when profiling workunits, the time spent here will be blamed to whichever @rule
is requesting the work. So pants.core.util_rules.source_files.determine_source_files
shows up warmer in profiles due to its usage of MergeDigests
.
It also occurs to me that this is potentially related to #13087 and #13492: the Directory
s that we avoid persisting would likely be similar in structure to those recursive dataclasses (although I think that we'll still want to do this on the Rust side).
This showed up again in profiles recently, so I'm going to try tackling it this week. Design sketch below.
A new DigestTree
structure will be added, with Arc
s for structure sharing. It will be very similar in structure to the protobuf Directory
entry, except with Arc
'd references to child nodes. All operations which currently manipulate or create Digest
s by persisting intermediate nodes to the Store
will be updated to instead operate on DigestTree
s. Each DigestTree
entry will eagerly compute its own Digest
, but will not persist it to disk.
The Digest
struct will gain an optional DigestTree
reference which contains its contents. If a Digest
has a DigestTree
reference, then that Digest
might not be persisted to the Store
. If the Digest
does not hold a DigestTree
, then that Digest
must have been persisted to the Store
(either locally or remotely). The field thus acts likes a cache in some cases, but in other cases is an indication that the tree must first be persisted (or loaded) before the Digest
may be operated on.
This intentionally does not change the Python API (yet): we'll continue to have separate Snapshot
and Digest
types, since the tree being optional allows for lazily loading it from a remote, for example. But Snapshot
and DigestEntries
can each begin to contain a DigestTree
rather than a list of PathStats
, which should improve memory usage.
cc @tdyas, @Eric-Arellano, @illicitonion, @jsirois for feedback.
Slight adjustment to the above. We had begun to differentiate "file digests" from "directory digests" in the Python API, but had not yet begun doing that on the Rust side. Rather than modifying hashing::Digest
(which is used interchangeably for files and directories) to add the optional field, I've taken the opportunity to add a DirectoryDigest
type with the optional DigestTree
.
The first of probably three or four PRs is now posted as #14654. The next one or two PRs will port a few of the Digest
/Directory
manipulating operations to consuming a DigestTrie
(MergeDigests
, DigestSubset
, etc).
Ok, the stack of PRs is now:
Benchmarking on #14677 shows that the performance without IO contention is about equal to main
. Given that the changes also significantly simplify all of the snapshot-manipulation implementations, I think we're ready to start landing them.
Preliminary micro-benchmarks of pants.core.util_rules.source_files.determine_source_files
with #14723 show a 56% improvement. Memory usage according to top
drops by ~10%. Will work to stabilize and land #14723 tomorrow.
Ok, https://github.com/pantsbuild/pants/pull/14758 will likely be the last one here for a while. The last two-ish methods (subset
, ensure_remote_has_recursive
) have diminishing returns for performance purposes... and subset
might be a bit of a bear. Let's leave the ticket open until those can be resolved, but I'll unassign to pick up some other 2.11.x
work.
DirectoryDigest::todo_as_digest
, it doesn't block closing this issue.
Exploring profiles of larger runs showed significant amounts of time spent in
intrinsics::merge_digests_request_to_digest
andintrinsics::digest_to_snapshot
(13% and 5% respectively of the total runtime), mostly in database access.After briefly exploring moving to storing
Tree
s rather thanDirectory
s (which would be less space efficient, but more time efficient), it occurred to me that many of theDirectory
s that we persist will never be used in places that actually need persistence (sent to the network, included as the output (not input) of a cache entry, etc), and will instead just be manipulated in memory.This suggests that in the medium term we could use a more efficient in-memory-only structure to represent
Directory
s, and only digest them and commit them to disk when we need to use them in a relevant position (while Files would of course stay digested and on disk). To get to this future, it's possible that a first incremental step would be to introduce a memory-onlyStore
forDirectory
s, and introduce an explicit "persist this and get me aDigest
" step. But: it will be important to avoid increasing the complexity of the code too much... it seems possible that some codepaths would become significantly simpler when implemented on a memory-only structure, and that that would justify using custom non-protobuf types.This relates to Bazel's
depsets
, in that the structure underlying a DAG of in-memoryDirectory
s could be a generic, native "nested set". That native nested set structure could also be used for usecases like #13087 and #13492.