Open hannahhoward opened 2 years ago
To complete the picture of the challenge we're facing here (for those following on here): what we end up with is 3 different layers of decoding, and in the first two, we have edges that are datamodel.Node
holding a full basicnode
-based structure (when we get to an Any
, bindnode will resort to basicnode
, even using basicnode.Prototype.Any.NewBuilder()
for BeginList()
and BeginMap()
). So graphsync starts off with these extensions as basicnode
structures, which can be quite large, then data-transfer comes along and decodes those into new Go types that also have datamodel.Node
fields hanging off them for vouchers, which have the same problem (I'm not 100% sure but I think it might even double-up on the building of new maps and lists and the rest in the process of AssignNode()
used to perform the conversion of the datamodel.Node
to the concrete extension type). Then go-fil-markets follows up ontop of the extension structure to decode its Any
fields into concrete Go types, finishing the picture.
And, on top of this, do we even reject unknown extensions in graphsync messages? So there could be additional dangling datamodel.Node
structures that aren't even touched? Maybe not a big deal, just messy.
So, when we see memory dumps, we get to see a lot of basicnode
elements that are just hanging around.
Ideally, when the codec/dagcbor
is pushing out nodes into a builder, that builder isn't relying on basicnode
at all, or it's doing so minimally, and that the nodes are landing directly into their final destination Go structs fronted by bindnode
.
a custom node builder for a given go-type
Yeah, this is certainly an option, and maybe it'd get us close, but we'd still end up wiring all of the pieces together very manually at the mid-point somewhere. It'd get very verbose and maybe more error-prone than we really want to take responsibility for that far from the Node
interface. One of the wins we currently have is that we're pushing down all of that responsibility into go-ipld-prime, accepting minimal responsibility at the top layer so we're not reinventing stuff.
Here's a rough sketch of what's in my head for this at the moment:
{ String: Any }
where the map has a single key. This would be a breaking message format in both cases. We can simulate this with a single-element kinded union prior to knowing what the proper union members are:type KeyedUnionPlaceholder union {
Any string
} representation kinded
# for graphsync:
type GraphSyncExtensions [KeyedUnionPlaceholder]
# and for data-transfer's messages:
type TransferRequest struct {
Vouch KeyedUnionPlaceholder
}
TypedPrototype
by a string key, or, only accepting a TypedPrototype
that is a keyed union with a single member (or, again a kinded union with a single string
member).TypedPrototype
for graphsync messages, and then again in data-transfer, where we can pluck out KeyedUnionPlaceholder
from the tree and replace it with a new, runtime-built TypedPrototype
for a keyed union that combines all of the registered extensions or vouchers into a single union.Building a custom TypedPrototype
that does this combining would be straightforward, and doesn't depend on messing with the guts of anything that exists, it could even be done externally.
The tricky pieces are:
bindnode
TypedPrototype
to insert custom logic at a specific point - how do you even specify this? We don't really have a way of identifying a specific point in a schema as it is, we'd want to be able to identify something like GraphSyncRequest#extensions[*]
as our insertion point for our custom delegated TypePrototype
but I don't think we want to burden ourselves with a new DSL for this kind of thing. Perhaps the easy path here is doing it by schema type name - whenever you hit ATypeName
, delegate to this custom builder .. we'd then have to keep track of schema type names in bindnode.interface{}
in these Any
positions, and properly accepting whatever the delegated builder says to put there. This probably isn't as big a problem as (1) is though.In terms of the runtime-constructed keyed union, here's a possible API for something that could do that:
builder := unionbuilder.NewKeyedUnionPrototype()
builder.AddElement("dt/extension1", DataTransferExtension1TypedPrototype)
builder.AddElement("dt/extension2", DataTransferExtension2TypedPrototype)
builder.AddElement("foo", FooTypedPrototype)
node, err := ipld.DecodeUsingPrototype(byts, dagcbor.Decoder, builder.Prototype)
Would make something that can do:
type KeyedUnion union {
Ext1 "dt/extension1"
Ext2 "dt/extension2"
Foo "foo"
} representation keyed
... where Ext1
, Ext2
and Foo
are delegated to their relevant TypedPrototype
implementations, which could be backed by bindnode, or codegen or something else.
The next bit would be taking builder.Prototype
and inserting it to those Any
positions in the bindnode schema. Which could be done with a new option.
bindnode.Wrap(&graphSyncMsg{}, gsMsgProto.Type(), bindnode.CustomAnyPrototype("SomeNamedAnyType", builder.Prototype)
(assuming type SomeNamedAnyType any
works and we can wire up these type names all the way to the position they're used, which might be a stretch since we only really use the schema to check correctness of the Go types at the moment and then discard it!)
I've been thinking through optimization of decoding of Graphsync Extensions & data transfer vouchers & IPLD's type system, as well as how they relate to bindnode & #437
The basic structure of both of Graphsync Extensions and Data Transfer Vouchers is as follows:
The primary difference between Graphsync Extensions & data transfer vouchers is a single graphsync request or response can contain multiple extensions where a data transfer message contains only a single voucher
The key complication for both of these systems is:
The current schema used for Graphsync extensions is:
The abbrievated current schema used for a data transfer message is:
First, let's look at compatibility between IPLD Schemas & Graphsync Extensions. Let's say we had a list of extension names to known extension types. A better schema for graphsync extensions would be something like
Assuming the schema builders provide APIs for building this at run time (not sure if they do), we could theoretically assemble such a type once we had a stable list of extensions known. However, we still have the problem of data from the network, and what to do values we don't recognize.
The optional allows us to figure out what to do with missing fields. It doesn't really say what to do with surplus fields in a representation. the current bindnode code will error -- https://github.com/ipld/go-ipld-prime/blob/master/node/bindnode/node.go#L809
Now let's look at data transfer vouchers. . Let's say we had a list of voucher type identifiers to known voucher types. A better schema for data transfer vouchers would be (cause there is only one present in a message):
This gets us to the same conditions that we had with GraphSync vouchers -- we can properly type the known vouchers but we can't do anything about the network.
There's one major additional problem also: the TransferRequest & TransferResponse in go-data-transfer currently contain these two fields directly in the larger struct, not wrapped in an envelop (and to be clear, the "inline" representation does not match either -- we aren't inlining the voucher fields, we would be inline the whole message based on the voucher type).
We probably will just need to break the message structure to get to schema compatibility.
What about bindnode compatibility?
Let's say we get things to schema compatibility, what about bindnode compatibility.
the current golang structs are as follows:
Bindnode expects struct types for both TypeStruct and TypeUnion. While on the one hand we could theoretically use reflect.StructOf to construct these types at run time, we'd then have to use a blank interface everywhere that struct was embedded to make it assignable.
I think we might at this point be able to extend our converters for structs somehow -- making maybe a conversion with "assignField" / "extractField" or something of that nature.
Long and short, there's a path but are a lot of things that need to happen, including a breaking wire change for data transfer, to allow this kind of fast deserialization with bindnode.
Escape hatch
Perhaps this is the moment where we have to say: maybe these structures are valid IPLD, but they really aren't valid schemas, or at least not valid, in practice, over the wire, as a schema any more specific than the ones we have now.
So perhaps we just need custom node implementations for these message types.
However, one final addition to the conversion system that would be helpful is being able to use a custom node builder for a given go-type. This would allow for example a custom implementation of GraphsyncExtensions without a complete custom implementation for the message format. I think this ultimately where I stand on the best past forward.