APIs for creating incrementally modified nodes

warpfork commented 2 years ago

We need APIs for creating new nodes which include content from a previous node, along with some additions or subtractions.

The first step is figuring out what pattern we use when placing these APIs. Critical check: the API should possible to implement while using COW (copy-on-write) semantics internally.

There might be more than one form of API matching this general description (depending on if it's mostly additions; mostly subtractions; if it's insertions-with-opinions-on-ordering, etc).

I suspect that, like has been a recurring theme lately, we should both introduce the feature itself via feature detection, and then probably accompany it by helper functions which expose the desired API to end users, and then tries to detect and use the feature, or does fallbacks quietly. (This might turn out a little tricky, though, considering actual objects must be returned for assembly to continue working on, and we don't want that to have to wrap them and end up with more allocations.)

It should be possible to apply an "IPLD Patch" operation (e.g., something very comparable to RFC 6902) onto whatever APIs we come up with. (The library API can focus on single elements, rather than any of the pathing parts of the Patch declarations -- but any of the semantics about e.g. order of insertion should be alignable.)

smrz2001 commented 2 years ago

Thanks for the discussions, @warpfork!

I've captured some initial thoughts here.

smrz2001 commented 2 years ago

I did some more thinking and prototyping, and realized that @warpfork's patch PR provides a much better base to start from than what I'd been thinking. In hindsight, perhaps that should have been the logical sequence of steps anyway: first a patch spec, then an update API to optimize patch application.

The IPLD traversal package seems to already have a lot of the bookkeeping I had in my earlier notes. Incremental updates would then need to focus on optimizing some of the traversal logic/bookkeeping and adding some companion copy-on-write optimizations in a particular Node implementation.

Here are some tweaks off of the patch PR and here's a very crude implementation in go-ipld-adl-hamt.

The latter PR currently uses the patch.Eval method to apply updates. The test I added is panicking due to unimplemented functionality in the HAMT code but I still wanted to put it in there for an idea of I'm trying to do. Next, I'm thinking I could implement something smarter in the HAMT code to apply the patch with more bookkeeping so that Amend can be called multiple times without creating a new Node each time. Alternatively (and I'm leaning towards this), I'll put up a simplified prototype of what I'm thinking by using a new Node implementation instead of messing with HAMT.

At some point, patch.Eval and traversal.FocusedTransform can take into account Nodes with prototypes matching NodePrototypeSupportingAmend. The real copy-on-write optimizations might need to be deferred till the Encode step, which can pull together base Node information and accumulated updates.

I have a strong hunch that there's a path to a one-time copy-on-write amend (for any number of amendments) possible here between patch, traversal, amend, and Encode.

smrz2001 commented 2 years ago

I have a strong hunch that there's a path to a one-time copy-on-write amend (for any number of amendments) possible here between patch, traversal, amend, and Encode.

traversal and patch can continue to be used for generic transformation that copies everything every time.
amend can optimize updates for Nodes within the Node graph being traversed that support amendment (could be none, some, or all of them).
Encode can detect amendment instructions embedded in an amendable Node and copy from the original node where possible.

warpfork commented 2 years ago

Just to describe the space and range of options a bit:

Giving responsibilities to the Encode step towards the end there is possible. However, it would have high implementation cost: it would have to be reimplemented for every codec, and that'd be really gnarly. High volumes of code to produce, maintain, and benchmark. (Being codec agnostic has a high toll for the project as a whole here -- can make a lot of work in areas like this.)

On the upside, shifting responsibilities to the Encode step is (almost certainly?) the only way to reach ultimate zerocopy all the way back to original byte slices (and thus reach absolute maximum fast).

So: I think we should probably design for a pitstop in the middle that gives us a codec agnostic solution, even if slightly less ultimate zerocopy. Otherwise the implementation work volume is just gonna incredible, and I'd worry might land out of practical reach.

If we can figure out yet another feature-detect API that lets us make encoders and nodes that can TRY to do the superfastpath, though, while having a natural fallback to the codec agnostic logic, that'd be 🚀🚀🚀 significant victory.

warpfork commented 2 years ago

I otherwise agree with the strong hunch about the possibilities :)

warpfork commented 2 years ago

I made some comments in the PR exploring using Patch for this -- summary: something in this direction could be very cool, but I wonder if we should make a separate set of types for it just not to get things too tightly coupled.

warpfork commented 2 years ago

One other thought on where this API should attach:

I think it should probably be on the NodeAssembler.

Not the NodeBuilder -- because then amends would only work at the root of some data manipulation, and it's fairly obvious why that would be silly.

Not the NodePrototype either -- for the same reason!

If we're doing a large walk through some data, and trying to build a new tree with updates efficiently, in the middle of that process, we're almost entirely handling NodeAssembler.

(The implementation of some functions in the traversal package may kind of obscure this, because they give you a Node and ask for a Node back, rather than exposing you the assembler. Looking at the internals of traversal.FocusedTransform gives a better view of what I mean. (And it's possible that we should make traversal functions that do expose the assembler more!))

It's probably possible to do this on the NodePrototype as well, considering you can always get one of those when you've got a NodeAssembler... but that really only gives you a way to do a builder, too; which means you'd still end up with new memory and need to use an AssignNode call later to put it into the intended destination tree, which would presumably be another memcopy, etc. So approaching from this way seems rough and hard to make fast.

(I think I've myself speculated that it might feel clean to put some of these amend feature detections on the NodePrototype, previously. I think I'm updating away from that idea, based on the above reasoning. Doublechecks, of course, welcome.)

smrz2001 commented 2 years ago

One other thought on where this API should attach:

I think it should probably be on the NodeAssembler.

If we're doing a large walk through some data, and trying to build a new tree with updates efficiently, in the middle of that process, we're almost entirely handling NodeAssembler.

Excellent points about NodeAssembler (and other interfaces), @warpfork. The relationships between the various IPLD abstractions have been quite hard for me to wrap my head around, but things are starting to get a little less hazy 😅

I'm going to study traversal.FocusedTransform more deeply and come back with questions.

smrz2001 commented 2 years ago

Based on my discussion with @warpfork and his further recommendations, here's a summary of where we are and where to go next. Please correct me if I misquoted anything, Eric, or missed any tasks I can add here. I'm sure I could break these tasks down further, and I probably will as I work through the list.

[ ] ~~@warpfork's patch implementation does wholesale, history-less Node updates but needs further optimization.~~
[ ] ~~Add NodeAssemblerAmending in datamodel to have a new, decoupled amendment interface hierarchy that does not create tight dependencies with existing code/packages.~~
[ ] ~~Apply new interface to patch implementation.~~
[ ] ~~Optimize patch implementation to accumulate updates and internal state (i.e. more "history") for efficient copy-on-write at Build time.~~
[ ] ~~Investigate future possibility of feature-detected application of patch instructions at Encode time.~~
[ ] ~~Investigate future possibility of feature-detected storage of internal state starting at Decode time, and flowing through to patch and Encode.~~

smrz2001 commented 2 years ago

Prototyping a different (and better, I feel) approach in my draft PR. Deprecated the notes above and added fresh ones directly in the PR.

I've kept all changes isolated in the traversal/amend folder for now. I also copied @warpfork's test code from his patch PR, and tweaked it to work with the prototype.

I've also only tested against TestSpecFixtures/adding-a-map-entry to start with so that I could demonstrate my thinking. I'm working on adding more support, fixing bugs, and testing against the other fixtures.

There are 3 main aspects of this approach:

JSON patch operations will transparently use Amendment operations, whose cost is amortized across multiple operations. All JSON Patch modification ops (add/remove/replace/move/copy) are reduced to add/remove Amendment ops.
No intermediate Node objects are constructed when applying amendments, only Amendment instructions embedded in new AmenderNode objects.
AmenderNode provides a "lens" to Encode, which sees a Node with all the modifications on read. No builder/assembler/prototype - just a Node that can be constructed using an existing Node.

This feels closer to the copy-on-write ideal we've been aiming for - Encode pulls from the original Node wherever possible, but overrides the original content wherever amendment instructions are present and applicable.

Currently, only "add" is implemented, and with some assumptions - the Node being added didn't already exist in the source Node, it's being added to the top-level of a map-type Node, etc. but these were all in the interest of getting something out sooner. All of this will improve as I continue fleshing out the logic.

Given that AmenderNode is a "lens", I was also wondering if it could be an ADL with the substrate being the original Node and the synthesized-view being the accumulated result of the instructions applied, but that seemed too much to bite off right now. Not even sure that's a good idea....

smrz2001 commented 2 years ago

Another advantage of accumulating updates this way is that expensive operations like recalculating hashes can be applied directly on the end result of a series of updates instead of on each update.

BigLep commented 2 years ago

Relevant PR: https://github.com/smrz2001/go-ipld-prime/pull/1

BigLep commented 1 year ago

@smrz2001 : just checking in to see if this is something you're planning to drive forward.

smrz2001 commented 1 year ago

@smrz2001 : just checking in to see if this is something you're planning to drive forward.

Hey @BigLep! Yes, I've been waiting on input on my PR for a little bit. After discussion, I re-implemented my code in a different way that's cleaner but, at the same time, more intrusive so I would understand hesitation to put that in without a lot more review. The code from the previous version would have been simpler and standalone but not super generalized.

I can do either but was hoping to have someone chime in on the PR so I can pick a direction and finalize it.

At this point, my preference would be to switch back to the previous approach - the only open question there was around the cleanest way to package the code.

BigLep commented 1 year ago

2023-01-24 maintainer conversation: Thanks @smrz2001 - ack. @rvagg will look and respond.

ipld / go-ipld-prime

APIs for creating incrementally modified nodes #320