ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

flexible object mapping for go-ipld #185

Open warpfork opened 7 years ago

warpfork commented 7 years ago

A better object mapping and serialization library in golang would make our lives better faster happier.

Let's do it! :tada:

Why?

We could really use better object mapping and serializers around go-ipld. Flexibility and ease of use is key. There are some missing features, and some present-but-overly-constrained features, in other serialization and object marshal/unmarshal libraries in use right now. Let's try again.

Some examples of things that are hard, and shouldn't be:

These things are all possible with existing libraries (well ok the third one and isn't and wellll ok the fourth one isn't either), but not graceful. For example, doing unmarshal polymorphism usually requires doing custom behaviors attached to the containing type rather than the one field and type that it applies to, which creates a lot of incidental complexity, which (in my personal history, anyway) means much (much!) more boilerplate code on tangentially related types, and a much high likelihood of bugs and code-consistency problems because of the friction.

Key Design Guidelines:

What really compelling, central choices will make a better system?

Design details (prospective, subject to change):

isn't that neat?

slightly further into the weeds...

Detail: what an Atlas actually looks like

I don't want to get completely committal on this until some more design spikes, because this is super important to get right and make this pleasant to use.

That said, here was an early tech draft: https://gist.github.com/heavenlyhash/b0cf495f94cebb4f1de366e86447b8ec (Mind the comment -- this is definitely not what we want to present to the end user; it's far too fragile. But when we use the "autoatlas" feature, this is what that generates, because it's the most efficient info to use when doing reflective field traversal. Our user-facing API will look different, but generate the same thing.)

The major key is to make an atlas declarative -- so it's in some ways similar to declaring a companion struct for your real struct with serialization tags on each field -- but make it a more intelligent API, so we have the option of doing...

You can think of the whole atlas thing as being a way to map in-memory go values to map[string]interface{} and back again; it's separate from any serialized encoding and decoding details entirely. It just also happens to have enough information to make that process significantly more efficient, operate streamingly, nearly zero-copy, etc, etc.

Detail: constructing atlases for your whole protocol

To handle marshal and unmarshal of complex objects, which have many types with fields referring to yet more types, we need more than one atlas.

This is pretty simple: we just rack up atlases along with a type thunk indicating what they apply to:

multiatlas := translate.NewMultiAtlas(Binder{
    {AA{}, atlasForAA},
    {BB{}, translate.AutoAtlasFor(BB{})},
    [...]
})

We should also expose some magic helper methods to let the user say they want this done with and out of their sight -- we can have a VarTokenSource default to looking at each new type, and auto-atlas'ing it as it encounters them, for example. (This is pretty much what the stdlib does within encoding/json.)

Detail: what if I really need wild custom marshal logic?

Declaring an atlas should provide enough flexibility for the majority of situations. The whole job of the atlas system is to make all common custom behaviors possible, without needing to manually code up MarshalJson method bodies! However, if you need behaviors like handling a set of fields that's unknown based on type info alone (and seriously, what are you doing there? But ah well), we can handle that too.

(Sidebar: ok, so there's the possibility of constructing the token-emitting and token-consuming functions yourself, but that's almost never what an end-user of this library should do (see following section for a glimpse on the depth of fiddly that gets into, then close your eyes, shake your head, and come back up here). tl;dr you want to define a mapping; you almost never care to re-implement the details of making sure that mapping can accept its inputs in any order. So, instead: we have a more intermediate layer.)

The user can declare an atlas generator, instead of an atlas. This path is a slight performance hit because of the complexity (and sheer number of mallocs implied), but not unduly so: it's basically the equivalent of declaring an MarshalJson method that uses an inline anonymous struct{/*...*/}{} declaration when using stdlib encoding/json.

Detail: TokenSource and TokenSink shall be implemented by CPS step funcs.

TokenSource and TokenSink implementations should both lean heavily on continuation-passing-style. (I've done a spike of this already where the token source would walk an object, and call the token sink as it went; this was simpler to write because it allows freely using the goroutine stack as the parse state stack, but overall turned out poor: it's difficult to be flexible with this approach, and I found it very anti-robust since each new TokenSource would need to correctly wield the TokenSink with no nanny code in-between that could catch obvious missteps.) The json.scanner type in the standard lib is a good example of this -- there's a state struct with a step func pointer, and a custom "stack" -- follow that pattern in all components.

With that done, the top level setup can look like this:

translator := NewCodecPair(NewJsonDecoder(r),   NewCborEncoder(w))

The engine code in NewCodecPair that glues the token source/sink together pumps the step functions of both components, and thus can check to make sure they're both operating correctly, and in case of panics, stack traces are quite clear.

putting a :bowtie: on it

This whole document has been a deep dive on how the internals of a good object mapping system should come together. But the happy path is much simpler:

translate.NewCborMarshaller(stdout).MustMarshal(&obj)

Nothing needs to look too fancy, up here.

Inside, we know that's going to create a token source and token sink, bind the pair together, have a default atlas in the VarTokenSource that generates new atlases using the "autoatlas" struct tag reflector and caches them as it goes along and hits new types, and so on. But that's everything the stdlib does already when you marshal json: and on the outside, we can definitely make this just as simple.

jbenet commented 7 years ago

Hey @heavenlyhash -- i agree with a significant fraction of the above, except that this isn't fully caught up with the formulation of IPLD (with tree and resolve functions, walking them to find values), and with various important constraints:

Other:

Also, i think Atlas is a clever name for "a mapping", though i think you just want "morphism" -- the actual mathematical name of what you mean. That said, i dont think you need it here, save as a shortcut or useful intermediate (hidden) representation.