pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.31k stars 635 forks source link

Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

Closed stuhood closed 2 years ago

stuhood commented 3 years ago

Exploring profiles of larger runs showed significant amounts of time spent in intrinsics::merge_digests_request_to_digest and intrinsics::digest_to_snapshot (13% and 5% respectively of the total runtime), mostly in database access.

After briefly exploring moving to storing Trees rather than Directorys (which would be less space efficient, but more time efficient), it occurred to me that many of the Directorys that we persist will never be used in places that actually need persistence (sent to the network, included as the output (not input) of a cache entry, etc), and will instead just be manipulated in memory.

This suggests that in the medium term we could use a more efficient in-memory-only structure to represent Directorys, and only digest them and commit them to disk when we need to use them in a relevant position (while Files would of course stay digested and on disk). To get to this future, it's possible that a first incremental step would be to introduce a memory-only Store for Directorys, and introduce an explicit "persist this and get me a Digest" step. But: it will be important to avoid increasing the complexity of the code too much... it seems possible that some codepaths would become significantly simpler when implemented on a memory-only structure, and that that would justify using custom non-protobuf types.

This relates to Bazel's depsets, in that the structure underlying a DAG of in-memory Directorys could be a generic, native "nested set". That native nested set structure could also be used for usecases like #13087 and #13492.

stuhood commented 2 years ago

This showed up in profiles again recently. There is no Node created for snapshot/digest merge operations (they're not memoized), so when profiling workunits, the time spent here will be blamed to whichever @rule is requesting the work. So pants.core.util_rules.source_files.determine_source_files shows up warmer in profiles due to its usage of MergeDigests.

stuhood commented 2 years ago

It also occurs to me that this is potentially related to #13087 and #13492: the Directorys that we avoid persisting would likely be similar in structure to those recursive dataclasses (although I think that we'll still want to do this on the Rust side).

stuhood commented 2 years ago

This showed up again in profiles recently, so I'm going to try tackling it this week. Design sketch below.

A new DigestTree structure will be added, with Arcs for structure sharing. It will be very similar in structure to the protobuf Directory entry, except with Arc'd references to child nodes. All operations which currently manipulate or create Digests by persisting intermediate nodes to the Store will be updated to instead operate on DigestTrees. Each DigestTree entry will eagerly compute its own Digest, but will not persist it to disk.

The Digest struct will gain an optional DigestTree reference which contains its contents. If a Digest has a DigestTree reference, then that Digest might not be persisted to the Store. If the Digest does not hold a DigestTree, then that Digest must have been persisted to the Store (either locally or remotely). The field thus acts likes a cache in some cases, but in other cases is an indication that the tree must first be persisted (or loaded) before the Digest may be operated on.

This intentionally does not change the Python API (yet): we'll continue to have separate Snapshot and Digest types, since the tree being optional allows for lazily loading it from a remote, for example. But Snapshot and DigestEntries can each begin to contain a DigestTree rather than a list of PathStats, which should improve memory usage.

cc @tdyas, @Eric-Arellano, @illicitonion, @jsirois for feedback.

stuhood commented 2 years ago

Slight adjustment to the above. We had begun to differentiate "file digests" from "directory digests" in the Python API, but had not yet begun doing that on the Rust side. Rather than modifying hashing::Digest (which is used interchangeably for files and directories) to add the optional field, I've taken the opportunity to add a DirectoryDigest type with the optional DigestTree.

stuhood commented 2 years ago

The first of probably three or four PRs is now posted as #14654. The next one or two PRs will port a few of the Digest/Directory manipulating operations to consuming a DigestTrie (MergeDigests, DigestSubset, etc).

stuhood commented 2 years ago

Ok, the stack of PRs is now:

  1. https://github.com/pantsbuild/pants/pull/14654
  2. https://github.com/pantsbuild/pants/pull/14697
  3. https://github.com/pantsbuild/pants/pull/14677
  4. https://github.com/pantsbuild/pants/pull/14749
  5. https://github.com/pantsbuild/pants/pull/14723
  6. https://github.com/pantsbuild/pants/pull/14758

Benchmarking on #14677 shows that the performance without IO contention is about equal to main. Given that the changes also significantly simplify all of the snapshot-manipulation implementations, I think we're ready to start landing them.

stuhood commented 2 years ago

Preliminary micro-benchmarks of pants.core.util_rules.source_files.determine_source_files with #14723 show a 56% improvement. Memory usage according to top drops by ~10%. Will work to stabilize and land #14723 tomorrow.

stuhood commented 2 years ago

Ok, https://github.com/pantsbuild/pants/pull/14758 will likely be the last one here for a while. The last two-ish methods (subset, ensure_remote_has_recursive) have diminishing returns for performance purposes... and subset might be a bit of a bear. Let's leave the ticket open until those can be resolved, but I'll unassign to pick up some other 2.11.x work.

stuhood commented 2 years ago

14889 fixes the last major case here: while there is one other usage of DirectoryDigest::todo_as_digest, it doesn't block closing this issue.