flexible object mapping for go-ipld

A better object mapping and serialization library in golang would make our lives better faster happier.

Let's do it! :tada:

Why?

We could really use better object mapping and serializers around go-ipld. Flexibility and ease of use is key. There are some missing features, and some present-but-overly-constrained features, in other serialization and object marshal/unmarshal libraries in use right now. Let's try again.

Some examples of things that are hard, and shouldn't be:

declaring custom marshal/unmarshal handling that's automatically symmetric on marshal and unmarshal
unmarshal polymorphism (deserializing something into an empty field with an interface type)
defining n>1 serial mappings for the same in-memory struct
{any of the above} x { streaming, zero-copy }

These things are all possible with existing libraries (well ok the third one and isn't and wellll ok the fourth one isn't either), but not graceful. For example, doing unmarshal polymorphism usually requires doing custom behaviors attached to the containing type rather than the one field and type that it applies to, which creates a lot of incidental complexity, which (in my personal history, anyway) means much (much!) more boilerplate code on tangentially related types, and a much high likelihood of bugs and code-consistency problems because of the friction.

Key Design Guidelines:

What really compelling, central choices will make a better system?

object mapping is one side of the system.
token stream handling (into bytes and out of bytes) are a completely separate component.

Design details (prospective, subject to change):

Let's have an "atlas" package. An "atlas" is a declarative description of a type, listing each field, giving it a name, and defining how to get a ref to it.
- This gives us the tools to create bidi mappings between objects in memory and token streams that represent them.
- You can define more "atlas" instances freely, even multiple atlases for one type. Why? Because an ipld obj hash covering a subset of an object is an awful lot like an index for a db, and I wanna be able to do that.
- Let's have an "autoatlas" method, which spits out an atlas for your types based on struct tags like we're used to. Convenience!
Let's have a "tokenSink" interface. ToMem is one; construct it with an atlas, and it reconstructs things in mem. ToCbor is another; construct it with an io.Writer, and it emits bytes.
Let's have a "tokenSource" interface. FromMem is one; construct it with an atlas, and it walks an object and emits tokens. FromCbor is another; construct it with an io.Reader, and it emits tokens.
Stretch goal: each token sink/source implementation may have a parameter for "schema notes": deserializing should take cliff notes on the format, and serializing can optionally take the same. Why? I want to work with yaml, and respect the human author's choice of heredocs-or-inline-strings, but at the same time this clearly belongs one step off to the side of the real data content stream.

isn't that neat?

stick a "FromMem" token source with a "ToCbor" token sink, and you get a scrumptious serializer
stick a "FromCbor" token source with a "ToMem" token sink, and you get a delicious deserializer
stick a "FromCbor" token source with a "ToJson" token sink, and you get a translator that doesn't need an in-mem type spec at all!
stick a "FromMem" token source with a "ToMem" token sink, and you get a deepCopy method for free.
basically "x->y" for any choice of "x" and "y"

slightly further into the weeds...

Detail: what an Atlas actually looks like

I don't want to get completely committal on this until some more design spikes, because this is super important to get right and make this pleasant to use.

That said, here was an early tech draft: https://gist.github.com/heavenlyhash/b0cf495f94cebb4f1de366e86447b8ec (Mind the comment -- this is definitely not what we want to present to the end user; it's far too fragile. But when we use the "autoatlas" feature, this is what that generates, because it's the most efficient info to use when doing reflective field traversal. Our user-facing API will look different, but generate the same thing.)

The major key is to make an atlas declarative -- so it's in some ways similar to declaring a companion struct for your real struct with serialization tags on each field -- but make it a more intelligent API, so we have the option of doing...

user-friendly things like "do the default for all these fields, but Imma override this one"
efficient things like handing over addresses to the final destination memory (instead of unpacking into one struct just to turn around and allocate another and write all the boilerplate to copy all the content...)

You can think of the whole atlas thing as being a way to map in-memory go values to map[string]interface{} and back again; it's separate from any serialized encoding and decoding details entirely. It just also happens to have enough information to make that process significantly more efficient, operate streamingly, nearly zero-copy, etc, etc.

Detail: constructing atlases for your whole protocol

To handle marshal and unmarshal of complex objects, which have many types with fields referring to yet more types, we need more than one atlas.

This is pretty simple: we just rack up atlases along with a type thunk indicating what they apply to:

multiatlas := translate.NewMultiAtlas(Binder{
    {AA{}, atlasForAA},
    {BB{}, translate.AutoAtlasFor(BB{})},
    [...]
})

We should also expose some magic helper methods to let the user say they want this done with and out of their sight -- we can have a VarTokenSource default to looking at each new type, and auto-atlas'ing it as it encounters them, for example. (This is pretty much what the stdlib does within encoding/json.)

Detail: what if I really need wild custom marshal logic?

Declaring an atlas should provide enough flexibility for the majority of situations. The whole job of the atlas system is to make all common custom behaviors possible, without needing to manually code up MarshalJson method bodies! However, if you need behaviors like handling a set of fields that's unknown based on type info alone (and seriously, what are you doing there? But ah well), we can handle that too.

(Sidebar: ok, so there's the possibility of constructing the token-emitting and token-consuming functions yourself, but that's almost never what an end-user of this library should do (see following section for a glimpse on the depth of fiddly that gets into, then close your eyes, shake your head, and come back up here). tl;dr you want to define a mapping; you almost never care to re-implement the details of making sure that mapping can accept its inputs in any order. So, instead: we have a more intermediate layer.)

The user can declare an atlas generator, instead of an atlas. This path is a slight performance hit because of the complexity (and sheer number of mallocs implied), but not unduly so: it's basically the equivalent of declaring an MarshalJson method that uses an inline anonymous struct{/*...*/}{} declaration when using stdlib encoding/json.

Detail: TokenSource and TokenSink shall be implemented by CPS step funcs.

TokenSource and TokenSink implementations should both lean heavily on continuation-passing-style. (I've done a spike of this already where the token source would walk an object, and call the token sink as it went; this was simpler to write because it allows freely using the goroutine stack as the parse state stack, but overall turned out poor: it's difficult to be flexible with this approach, and I found it very anti-robust since each new TokenSource would need to correctly wield the TokenSink with no nanny code in-between that could catch obvious missteps.) The json.scanner type in the standard lib is a good example of this -- there's a state struct with a step func pointer, and a custom "stack" -- follow that pattern in all components.

With that done, the top level setup can look like this:

translator := NewCodecPair(NewJsonDecoder(r),   NewCborEncoder(w))

The engine code in NewCodecPair that glues the token source/sink together pumps the step functions of both components, and thus can check to make sure they're both operating correctly, and in case of panics, stack traces are quite clear.

putting a :bowtie: on it

This whole document has been a deep dive on how the internals of a good object mapping system should come together. But the happy path is much simpler:

translate.NewCborMarshaller(stdout).MustMarshal(&obj)

Nothing needs to look too fancy, up here.

Inside, we know that's going to create a token source and token sink, bind the pair together, have a default atlas in the VarTokenSource that generates new atlases using the "autoatlas" struct tag reflector and caches them as it goes along and hits new types, and so on. But that's everything the stdlib does already when you marshal json: and on the outside, we can definitely make this just as simple.

ipfs / notes