Open warpfork opened 2 years ago
Thanks for the discussions, @warpfork!
I've captured some initial thoughts here.
I did some more thinking and prototyping, and realized that @warpfork's patch PR provides a much better base to start from than what I'd been thinking. In hindsight, perhaps that should have been the logical sequence of steps anyway: first a patch spec, then an update API to optimize patch application.
The IPLD traversal
package seems to already have a lot of the bookkeeping I had in my earlier notes. Incremental updates would then need to focus on optimizing some of the traversal
logic/bookkeeping and adding some companion copy-on-write optimizations in a particular Node
implementation.
Here are some tweaks off of the patch PR and here's a very crude implementation in go-ipld-adl-hamt
.
The latter PR currently uses the patch.Eval
method to apply updates. The test I added is panicking due to unimplemented functionality in the HAMT code but I still wanted to put it in there for an idea of I'm trying to do. Next, I'm thinking I could implement something smarter in the HAMT code to apply the patch with more bookkeeping so that Amend
can be called multiple times without creating a new Node
each time. Alternatively (and I'm leaning towards this), I'll put up a simplified prototype of what I'm thinking by using a new Node
implementation instead of messing with HAMT.
At some point, patch.Eval
and traversal.FocusedTransform
can take into account Nodes
with prototypes matching NodePrototypeSupportingAmend
. The real copy-on-write optimizations might need to be deferred till the Encode
step, which can pull together base Node
information and accumulated updates.
I have a strong hunch that there's a path to a one-time copy-on-write amend (for any number of amendments) possible here between patch
, traversal
, amend
, and Encode
.
I have a strong hunch that there's a path to a one-time copy-on-write amend (for any number of amendments) possible here between patch, traversal, amend, and Encode.
traversal
and patch
can continue to be used for generic transformation that copies everything every time.amend
can optimize updates for Nodes
within the Node
graph being traversed that support amendment (could be none, some, or all of them).Encode
can detect amendment instructions embedded in an amendable Node
and copy from the original node where possible.Just to describe the space and range of options a bit:
Giving responsibilities to the Encode step towards the end there is possible. However, it would have high implementation cost: it would have to be reimplemented for every codec, and that'd be really gnarly. High volumes of code to produce, maintain, and benchmark. (Being codec agnostic has a high toll for the project as a whole here -- can make a lot of work in areas like this.)
On the upside, shifting responsibilities to the Encode step is (almost certainly?) the only way to reach ultimate zerocopy all the way back to original byte slices (and thus reach absolute maximum fast).
So: I think we should probably design for a pitstop in the middle that gives us a codec agnostic solution, even if slightly less ultimate zerocopy. Otherwise the implementation work volume is just gonna incredible, and I'd worry might land out of practical reach.
If we can figure out yet another feature-detect API that lets us make encoders and nodes that can TRY to do the superfastpath, though, while having a natural fallback to the codec agnostic logic, that'd be 🚀🚀🚀 significant victory.
I otherwise agree with the strong hunch about the possibilities :)
I made some comments in the PR exploring using Patch for this -- summary: something in this direction could be very cool, but I wonder if we should make a separate set of types for it just not to get things too tightly coupled.
One other thought on where this API should attach:
I think it should probably be on the NodeAssembler
.
Not the NodeBuilder
-- because then amends would only work at the root of some data manipulation, and it's fairly obvious why that would be silly.
Not the NodePrototype
either -- for the same reason!
If we're doing a large walk through some data, and trying to build a new tree with updates efficiently, in the middle of that process, we're almost entirely handling NodeAssembler
.
(The implementation of some functions in the traversal
package may kind of obscure this, because they give you a Node
and ask for a Node
back, rather than exposing you the assembler. Looking at the internals of traversal.FocusedTransform
gives a better view of what I mean. (And it's possible that we should make traversal
functions that do expose the assembler more!))
It's probably possible to do this on the NodePrototype
as well, considering you can always get one of those when you've got a NodeAssembler
... but that really only gives you a way to do a builder, too; which means you'd still end up with new memory and need to use an AssignNode
call later to put it into the intended destination tree, which would presumably be another memcopy, etc. So approaching from this way seems rough and hard to make fast.
(I think I've myself speculated that it might feel clean to put some of these amend feature detections on the NodePrototype, previously. I think I'm updating away from that idea, based on the above reasoning. Doublechecks, of course, welcome.)
One other thought on where this API should attach:
I think it should probably be on the
NodeAssembler
.If we're doing a large walk through some data, and trying to build a new tree with updates efficiently, in the middle of that process, we're almost entirely handling
NodeAssembler
.
Excellent points about NodeAssembler
(and other interfaces), @warpfork. The relationships between the various IPLD abstractions have been quite hard for me to wrap my head around, but things are starting to get a little less hazy 😅
I'm going to study traversal.FocusedTransform
more deeply and come back with questions.
Based on my discussion with @warpfork and his further recommendations, here's a summary of where we are and where to go next. Please correct me if I misquoted anything, Eric, or missed any tasks I can add here. I'm sure I could break these tasks down further, and I probably will as I work through the list.
patch
implementation does wholesale, history-less Node
updates but needs further optimization.NodeAssemblerAmending
in datamodel
to have a new, decoupled amendment interface hierarchy that does not create tight dependencies with existing code/packages.patch
implementation.patch
implementation to accumulate updates and internal state (i.e. more "history") for efficient copy-on-write at Build
time.patch
instructions at Encode
time.Decode
time, and flowing through to patch
and Encode
.Prototyping a different (and better, I feel) approach in my draft PR. Deprecated the notes above and added fresh ones directly in the PR.
I've kept all changes isolated in the traversal/amend
folder for now. I also copied @warpfork's test code from his patch
PR, and tweaked it to work with the prototype.
I've also only tested against TestSpecFixtures/adding-a-map-entry
to start with so that I could demonstrate my thinking. I'm working on adding more support, fixing bugs, and testing against the other fixtures.
There are 3 main aspects of this approach:
Amendment
operations, whose cost is amortized across multiple operations. All JSON Patch modification ops (add/remove/replace/move/copy) are reduced to add/remove Amendment
ops.Node
objects are constructed when applying amendments, only Amendment
instructions embedded in new AmenderNode
objects.AmenderNode
provides a "lens" to Encode
, which sees a Node
with all the modifications on read. No builder/assembler/prototype - just a Node
that can be constructed using an existing Node
.This feels closer to the copy-on-write ideal we've been aiming for - Encode
pulls from the original Node
wherever possible, but overrides the original content wherever amendment instructions are present and applicable.
Currently, only "add" is implemented, and with some assumptions - the Node
being added didn't already exist in the source Node
, it's being added to the top-level of a map-type Node
, etc. but these were all in the interest of getting something out sooner. All of this will improve as I continue fleshing out the logic.
Given that AmenderNode
is a "lens", I was also wondering if it could be an ADL with the substrate being the original Node
and the synthesized-view being the accumulated result of the instructions applied, but that seemed too much to bite off right now. Not even sure that's a good idea....
Another advantage of accumulating updates this way is that expensive operations like recalculating hashes can be applied directly on the end result of a series of updates instead of on each update.
Relevant PR: https://github.com/smrz2001/go-ipld-prime/pull/1
@smrz2001 : just checking in to see if this is something you're planning to drive forward.
@smrz2001 : just checking in to see if this is something you're planning to drive forward.
Hey @BigLep! Yes, I've been waiting on input on my PR for a little bit. After discussion, I re-implemented my code in a different way that's cleaner but, at the same time, more intrusive so I would understand hesitation to put that in without a lot more review. The code from the previous version would have been simpler and standalone but not super generalized.
I can do either but was hoping to have someone chime in on the PR so I can pick a direction and finalize it.
At this point, my preference would be to switch back to the previous approach - the only open question there was around the cleanest way to package the code.
2023-01-24 maintainer conversation: Thanks @smrz2001 - ack. @rvagg will look and respond.
We need APIs for creating new nodes which include content from a previous node, along with some additions or subtractions.
The first step is figuring out what pattern we use when placing these APIs. Critical check: the API should possible to implement while using COW (copy-on-write) semantics internally.
There might be more than one form of API matching this general description (depending on if it's mostly additions; mostly subtractions; if it's insertions-with-opinions-on-ordering, etc).
I suspect that, like has been a recurring theme lately, we should both introduce the feature itself via feature detection, and then probably accompany it by helper functions which expose the desired API to end users, and then tries to detect and use the feature, or does fallbacks quietly. (This might turn out a little tricky, though, considering actual objects must be returned for assembly to continue working on, and we don't want that to have to wrap them and end up with more allocations.)
It should be possible to apply an "IPLD Patch" operation (e.g., something very comparable to RFC 6902) onto whatever APIs we come up with. (The library API can focus on single elements, rather than any of the pathing parts of the Patch declarations -- but any of the semantics about e.g. order of insertion should be alignable.)