ipld / go-ipld-prime

Golang interfaces for the IPLD Data Model, with core Codecs included, IPLD Schemas support, and some handy functional transforms tools.
MIT License
132 stars 50 forks source link

Node deserialization of Graphsync Extensions & data transfer vouchers #443

Open hannahhoward opened 2 years ago

hannahhoward commented 2 years ago

I've been thinking through optimization of decoding of Graphsync Extensions & data transfer vouchers & IPLD's type system, as well as how they relate to bindnode & #437

The basic structure of both of Graphsync Extensions and Data Transfer Vouchers is as follows:

The primary difference between Graphsync Extensions & data transfer vouchers is a single graphsync request or response can contain multiple extensions where a data transfer message contains only a single voucher

The key complication for both of these systems is:

The current schema used for Graphsync extensions is:

# Extensions as a name:data map where the data is any arbitrary, valid Node
type GraphSyncExtensions { String : nullable Any }

The abbrievated current schema used for a data transfer message is:

type TypeIdentifier string

# ...

type TransferRequest struct {
# ... other fields ...
    VoucherPtr            nullable Any            (rename "Vouch")
    VoucherTypeIdentifier          TypeIdentifier (rename "VTyp")
}

First, let's look at compatibility between IPLD Schemas & Graphsync Extensions. Let's say we had a list of extension names to known extension types. A better schema for graphsync extensions would be something like

type GraphSyncExtensions struct {
   KnownKind1 optional nullable KnownKind1 (rename "KnownExtensionName1")
   KnownKind2 optional nullable KnownKind2 (rename "KnownExtensionName2")
} representation map

Assuming the schema builders provide APIs for building this at run time (not sure if they do), we could theoretically assemble such a type once we had a stable list of extensions known. However, we still have the problem of data from the network, and what to do values we don't recognize.

The optional allows us to figure out what to do with missing fields. It doesn't really say what to do with surplus fields in a representation. the current bindnode code will error -- https://github.com/ipld/go-ipld-prime/blob/master/node/bindnode/node.go#L809

Now let's look at data transfer vouchers. . Let's say we had a list of voucher type identifiers to known voucher types. A better schema for data transfer vouchers would be (cause there is only one present in a message):


```ipldsch
type Voucher union {
  | KnownVoucherKind1 "KnownVoucherTypeName1"
  | KnownVoucherKind2 "KnownVoucherTypeName2"
} representation envelope {
  discriminantKey "Vouch"
  contentKey "VType"
}

This gets us to the same conditions that we had with GraphSync vouchers -- we can properly type the known vouchers but we can't do anything about the network.

There's one major additional problem also: the TransferRequest & TransferResponse in go-data-transfer currently contain these two fields directly in the larger struct, not wrapped in an envelop (and to be clear, the "inline" representation does not match either -- we aren't inlining the voucher fields, we would be inline the whole message based on the voucher type).

We probably will just need to break the message structure to get to schema compatibility.

What about bindnode compatibility?

Let's say we get things to schema compatibility, what about bindnode compatibility.

the current golang structs are as follows:

// GraphSyncExtensions is a container for representing extension data for
// bindnode, it's converted to a graphsync.ExtensionData list by
// ToExtensionsList()
type GraphSyncExtensions struct {
    Keys   []string
    Values map[string]*datamodel.Node
}|

// data - transfer
// TransferRequest1_1 is a struct for the 1.1 Data Transfer Protocol that fulfills the datatransfer.Request interface.
// its members are exported to be used by cbor-gen
type TransferRequest1_1 struct {
    // ... other fields ...
    VoucherTypeIdentifier datatransfer.TypeIdentifier
    TransferId            uint64
}

Bindnode expects struct types for both TypeStruct and TypeUnion. While on the one hand we could theoretically use reflect.StructOf to construct these types at run time, we'd then have to use a blank interface everywhere that struct was embedded to make it assignable.

I think we might at this point be able to extend our converters for structs somehow -- making maybe a conversion with "assignField" / "extractField" or something of that nature.

Long and short, there's a path but are a lot of things that need to happen, including a breaking wire change for data transfer, to allow this kind of fast deserialization with bindnode.

Escape hatch

Perhaps this is the moment where we have to say: maybe these structures are valid IPLD, but they really aren't valid schemas, or at least not valid, in practice, over the wire, as a schema any more specific than the ones we have now.

So perhaps we just need custom node implementations for these message types.

However, one final addition to the conversion system that would be helpful is being able to use a custom node builder for a given go-type. This would allow for example a custom implementation of GraphsyncExtensions without a complete custom implementation for the message format. I think this ultimately where I stand on the best past forward.

rvagg commented 2 years ago

To complete the picture of the challenge we're facing here (for those following on here): what we end up with is 3 different layers of decoding, and in the first two, we have edges that are datamodel.Node holding a full basicnode-based structure (when we get to an Any, bindnode will resort to basicnode, even using basicnode.Prototype.Any.NewBuilder() for BeginList() and BeginMap()). So graphsync starts off with these extensions as basicnode structures, which can be quite large, then data-transfer comes along and decodes those into new Go types that also have datamodel.Node fields hanging off them for vouchers, which have the same problem (I'm not 100% sure but I think it might even double-up on the building of new maps and lists and the rest in the process of AssignNode() used to perform the conversion of the datamodel.Node to the concrete extension type). Then go-fil-markets follows up ontop of the extension structure to decode its Any fields into concrete Go types, finishing the picture.

And, on top of this, do we even reject unknown extensions in graphsync messages? So there could be additional dangling datamodel.Node structures that aren't even touched? Maybe not a big deal, just messy.

So, when we see memory dumps, we get to see a lot of basicnode elements that are just hanging around.

Ideally, when the codec/dagcbor is pushing out nodes into a builder, that builder isn't relying on basicnode at all, or it's doing so minimally, and that the nodes are landing directly into their final destination Go structs fronted by bindnode.

a custom node builder for a given go-type

Yeah, this is certainly an option, and maybe it'd get us close, but we'd still end up wiring all of the pieces together very manually at the mid-point somewhere. It'd get very verbose and maybe more error-prone than we really want to take responsibility for that far from the Node interface. One of the wins we currently have is that we're pushing down all of that responsibility into go-ipld-prime, accepting minimal responsibility at the top layer so we're not reinventing stuff.

Here's a rough sketch of what's in my head for this at the moment:

type KeyedUnionPlaceholder union {
  Any string
} representation kinded

# for graphsync:

type GraphSyncExtensions [KeyedUnionPlaceholder]

# and for data-transfer's messages:

type TransferRequest struct {
  Vouch KeyedUnionPlaceholder
}

Building a custom TypedPrototype that does this combining would be straightforward, and doesn't depend on messing with the guts of anything that exists, it could even be done externally.

The tricky pieces are:

  1. Picking apart the bindnode TypedPrototype to insert custom logic at a specific point - how do you even specify this? We don't really have a way of identifying a specific point in a schema as it is, we'd want to be able to identify something like GraphSyncRequest#extensions[*] as our insertion point for our custom delegated TypePrototype but I don't think we want to burden ourselves with a new DSL for this kind of thing. Perhaps the easy path here is doing it by schema type name - whenever you hit ATypeName, delegate to this custom builder .. we'd then have to keep track of schema type names in bindnode.
  2. Allowing bindnode to hold an interface{} in these Any positions, and properly accepting whatever the delegated builder says to put there. This probably isn't as big a problem as (1) is though.
rvagg commented 2 years ago

In terms of the runtime-constructed keyed union, here's a possible API for something that could do that:

builder := unionbuilder.NewKeyedUnionPrototype()
builder.AddElement("dt/extension1", DataTransferExtension1TypedPrototype)
builder.AddElement("dt/extension2", DataTransferExtension2TypedPrototype)
builder.AddElement("foo", FooTypedPrototype)

node, err := ipld.DecodeUsingPrototype(byts, dagcbor.Decoder, builder.Prototype)

Would make something that can do:

type KeyedUnion union {
  Ext1 "dt/extension1"
  Ext2 "dt/extension2"
  Foo "foo"
} representation keyed

... where Ext1, Ext2 and Foo are delegated to their relevant TypedPrototype implementations, which could be backed by bindnode, or codegen or something else.

The next bit would be taking builder.Prototype and inserting it to those Any positions in the bindnode schema. Which could be done with a new option.

bindnode.Wrap(&graphSyncMsg{}, gsMsgProto.Type(), bindnode.CustomAnyPrototype("SomeNamedAnyType", builder.Prototype)

(assuming type SomeNamedAnyType any works and we can wire up these type names all the way to the position they're used, which might be a stretch since we only really use the schema to check correctness of the Go types at the moment and then discard it!)