I've been thinking through optimization of decoding of Graphsync Extensions & data transfer vouchers & IPLD's type system, as well as how they relate to bindnode & #437

The basic structure of both of Graphsync Extensions and Data Transfer Vouchers is as follows:

a key that identifies the kind of data
a value that contains the data, whose type is determined by the key

The primary difference between Graphsync Extensions & data transfer vouchers is a single graphsync request or response can contain multiple extensions where a data transfer message contains only a single voucher

The key complication for both of these systems is:

the set of known key -> value type conversions are set at run time (by higher level libraries using go-data-transfer or go-graphsync)
there data is transmitted over the network, implying that we may receive types we simply don't know about

The current schema used for Graphsync extensions is:

# Extensions as a name:data map where the data is any arbitrary, valid Node
type GraphSyncExtensions { String : nullable Any }

The abbrievated current schema used for a data transfer message is:

type TypeIdentifier string

# ...

type TransferRequest struct {
# ... other fields ...
    VoucherPtr            nullable Any            (rename "Vouch")
    VoucherTypeIdentifier          TypeIdentifier (rename "VTyp")
}

First, let's look at compatibility between IPLD Schemas & Graphsync Extensions. Let's say we had a list of extension names to known extension types. A better schema for graphsync extensions would be something like

type GraphSyncExtensions struct {
   KnownKind1 optional nullable KnownKind1 (rename "KnownExtensionName1")
   KnownKind2 optional nullable KnownKind2 (rename "KnownExtensionName2")
} representation map

Assuming the schema builders provide APIs for building this at run time (not sure if they do), we could theoretically assemble such a type once we had a stable list of extensions known. However, we still have the problem of data from the network, and what to do values we don't recognize.

The optional allows us to figure out what to do with missing fields. It doesn't really say what to do with surplus fields in a representation. the current bindnode code will error -- https://github.com/ipld/go-ipld-prime/blob/master/node/bindnode/node.go#L809

Now let's look at data transfer vouchers. . Let's say we had a list of voucher type identifiers to known voucher types. A better schema for data transfer vouchers would be (cause there is only one present in a message):


```ipldsch
type Voucher union {
  | KnownVoucherKind1 "KnownVoucherTypeName1"
  | KnownVoucherKind2 "KnownVoucherTypeName2"
} representation envelope {
  discriminantKey "Vouch"
  contentKey "VType"
}

This gets us to the same conditions that we had with GraphSync vouchers -- we can properly type the known vouchers but we can't do anything about the network.

There's one major additional problem also: the TransferRequest & TransferResponse in go-data-transfer currently contain these two fields directly in the larger struct, not wrapped in an envelop (and to be clear, the "inline" representation does not match either -- we aren't inlining the voucher fields, we would be inline the whole message based on the voucher type).

We probably will just need to break the message structure to get to schema compatibility.

What about bindnode compatibility?

Let's say we get things to schema compatibility, what about bindnode compatibility.

the current golang structs are as follows:

// GraphSyncExtensions is a container for representing extension data for
// bindnode, it's converted to a graphsync.ExtensionData list by
// ToExtensionsList()
type GraphSyncExtensions struct {
    Keys   []string
    Values map[string]*datamodel.Node
}|

// data - transfer
// TransferRequest1_1 is a struct for the 1.1 Data Transfer Protocol that fulfills the datatransfer.Request interface.
// its members are exported to be used by cbor-gen
type TransferRequest1_1 struct {
    // ... other fields ...
    VoucherTypeIdentifier datatransfer.TypeIdentifier
    TransferId            uint64
}

Bindnode expects struct types for both TypeStruct and TypeUnion. While on the one hand we could theoretically use reflect.StructOf to construct these types at run time, we'd then have to use a blank interface everywhere that struct was embedded to make it assignable.

I think we might at this point be able to extend our converters for structs somehow -- making maybe a conversion with "assignField" / "extractField" or something of that nature.

Long and short, there's a path but are a lot of things that need to happen, including a breaking wire change for data transfer, to allow this kind of fast deserialization with bindnode.

Escape hatch

Perhaps this is the moment where we have to say: maybe these structures are valid IPLD, but they really aren't valid schemas, or at least not valid, in practice, over the wire, as a schema any more specific than the ones we have now.

So perhaps we just need custom node implementations for these message types.

However, one final addition to the conversion system that would be helpful is being able to use a custom node builder for a given go-type. This would allow for example a custom implementation of GraphsyncExtensions without a complete custom implementation for the message format. I think this ultimately where I stand on the best past forward.

To complete the picture of the challenge we're facing here (for those following on here): what we end up with is 3 different layers of decoding, and in the first two, we have edges that are datamodel.Node holding a full basicnode-based structure (when we get to an Any, bindnode will resort to basicnode, even using basicnode.Prototype.Any.NewBuilder() for BeginList() and BeginMap()). So graphsync starts off with these extensions as basicnode structures, which can be quite large, then data-transfer comes along and decodes those into new Go types that also have datamodel.Node fields hanging off them for vouchers, which have the same problem (I'm not 100% sure but I think it might even double-up on the building of new maps and lists and the rest in the process of AssignNode() used to perform the conversion of the datamodel.Node to the concrete extension type). Then go-fil-markets follows up ontop of the extension structure to decode its Any fields into concrete Go types, finishing the picture.

And, on top of this, do we even reject unknown extensions in graphsync messages? So there could be additional dangling datamodel.Node structures that aren't even touched? Maybe not a big deal, just messy.

So, when we see memory dumps, we get to see a lot of basicnode elements that are just hanging around.

Ideally, when the codec/dagcbor is pushing out nodes into a builder, that builder isn't relying on basicnode at all, or it's doing so minimally, and that the nodes are landing directly into their final destination Go structs fronted by bindnode.

a custom node builder for a given go-type

Yeah, this is certainly an option, and maybe it'd get us close, but we'd still end up wiring all of the pieces together very manually at the mid-point somewhere. It'd get very verbose and maybe more error-prone than we really want to take responsibility for that far from the Node interface. One of the wins we currently have is that we're pushing down all of that responsibility into go-ipld-prime, accepting minimal responsibility at the top layer so we're not reinventing stuff.

Here's a rough sketch of what's in my head for this at the moment:

Unions are meant to help solve a lot of the problems of efficiently selecting between sub-schemas, the ideal is that we'd have a union that bundles everything together and it choose the right path for further decoding the correct piece.
Keyed unions are the obvious choice here I think, they're obvious and simple and you can use them with string identifiers, envelope unions are kind of the same, but have a messy representation that mainly exists to support existing data that looks like that.
We could implement both extensions and vouchers as keyed unions, so { String: Any } where the map has a single key. This would be a breaking message format in both cases. We can simulate this with a single-element kinded union prior to knowing what the proper union members are:

type KeyedUnionPlaceholder union {
  Any string
} representation kinded

# for graphsync:

type GraphSyncExtensions [KeyedUnionPlaceholder]

# and for data-transfer's messages:

type TransferRequest struct {
  Vouch KeyedUnionPlaceholder
}

There would have to be a process of "registering" an extension or voucher as a TypedPrototype by a string key, or, only accepting a TypedPrototype that is a keyed union with a single member (or, again a kinded union with a single string member).
The special-sauce would be in having an API that lets us mess with the parent TypedPrototype for graphsync messages, and then again in data-transfer, where we can pluck out KeyedUnionPlaceholder from the tree and replace it with a new, runtime-built TypedPrototype for a keyed union that combines all of the registered extensions or vouchers into a single union.

Building a custom TypedPrototype that does this combining would be straightforward, and doesn't depend on messing with the guts of anything that exists, it could even be done externally.

The tricky pieces are:

Picking apart the bindnode TypedPrototype to insert custom logic at a specific point - how do you even specify this? We don't really have a way of identifying a specific point in a schema as it is, we'd want to be able to identify something like GraphSyncRequest#extensions[*] as our insertion point for our custom delegated TypePrototype but I don't think we want to burden ourselves with a new DSL for this kind of thing. Perhaps the easy path here is doing it by schema type name - whenever you hit ATypeName, delegate to this custom builder .. we'd then have to keep track of schema type names in bindnode.
Allowing bindnode to hold an interface{} in these Any positions, and properly accepting whatever the delegated builder says to put there. This probably isn't as big a problem as (1) is though.

In terms of the runtime-constructed keyed union, here's a possible API for something that could do that:

builder := unionbuilder.NewKeyedUnionPrototype()
builder.AddElement("dt/extension1", DataTransferExtension1TypedPrototype)
builder.AddElement("dt/extension2", DataTransferExtension2TypedPrototype)
builder.AddElement("foo", FooTypedPrototype)

node, err := ipld.DecodeUsingPrototype(byts, dagcbor.Decoder, builder.Prototype)

Would make something that can do:

type KeyedUnion union {
  Ext1 "dt/extension1"
  Ext2 "dt/extension2"
  Foo "foo"
} representation keyed

... where Ext1, Ext2 and Foo are delegated to their relevant TypedPrototype implementations, which could be backed by bindnode, or codegen or something else.

The next bit would be taking builder.Prototype and inserting it to those Any positions in the bindnode schema. Which could be done with a new option.

bindnode.Wrap(&graphSyncMsg{}, gsMsgProto.Type(), bindnode.CustomAnyPrototype("SomeNamedAnyType", builder.Prototype)

(assuming type SomeNamedAnyType any works and we can wire up these type names all the way to the position they're used, which might be a stretch since we only really use the schema to check correctness of the Go types at the moment and then discard it!)

ipld / go-ipld-prime

Node deserialization of Graphsync Extensions & data transfer vouchers #443

What about bindnode compatibility?

Escape hatch