Extensions API: A way for users to broaden supported types

lilnasy commented 1 year ago

API

interface Extension<T> {
    when   : (x      : unknown)     => x is T
    encode : (x      : T)           => ArrayBuffer
    decode : (buffer : ArrayBuffer) => T
}

function createCodec(extensions : Extension[]): { encode : typeof EsCodec.encode, decode : typeof EsCodec.decode }

Usage

import * as EsCodec from "es-codec"

const { encode, decode } = EsCodec.createCodec([
    {
        when  (x : unknown)        : x is URL    { return x.constructor === URL },
        encode(x : URL)            : ArrayBuffer { return EsCodec.encode(x.href) },
        decode(buffer: ArrayBuffer): URL         { return new URL(EsCodec.decode(buffer) as string) }
    }
])

createCodec accepts an array of extensions, and returns encode and decode functions that support the corresponding types.
The same codec must be created on both the receiving side and sending side. The order of the array matters.
The custom codec can introduce a maximum of 128 types.

lilnasy commented 1 year ago

@jeff-hykin I want to know what you think about this

jeff-hykin commented 1 year ago

I was actually thinking about this earlier this morning.

This feature seems like a good way to avoid the complexity of having a class-method based encoding/decoding.

There are few problems I see though.

I think the ordering being important is really painful. Imagine running experiments and saving them to disk using es_serial because the results are too large to save in a json file. Code bases change, types come and go frequently. So now imagine having this O(n^2) list of lists, where the first element is all the encoder/decoders in the order they were in two weeks ago, the second element is the all the encoder/decoders in they order they were in one week ago, and the third element is all the encoder/decoders in the current/latest order. All that boilerplate (and dangling old-versions of classes/encoders-decoders) just so that experimental results from two weeks can still be loaded/analyzed. I analyze results from 8-months-old 120Mb serialized files. It could quickly get gruesome. Instead of a list, its possible to just expose the tag mapping as an enumeration of some kind, like literally an object with numbers as keys, so that the same encoder/decoder can keep the same number over time, and allow new types come and go. I think a better approach is to just expand the tag size, have named encoder/decoders and hash the name to a UUID, which is directly used as a 64 or 128bit tag.
I don't think this is too hard to solve, but I think the 128 types limit is too strict because people would have to fork the repo to solve the limit. I don't mean this in a negative way but It feels less like an "extension" and more like a hack to fit some extra types in. I'd probably rather have a "pure" serialization than one that supports a hard limit of 128 types. Now, what I'm about to propose is separate from my UUID recommendation, but I think an easy backwards-compatible solution is to create a setTag and getTag. For getTag, do getUint8, and if the last bit of the uint8 equals 1 then, get another uint8 and keep repeating until the last bit isn't 1. Then it's easy to combine the other bits into a number; I've got a handy function in my deno binaryify module that does this in both directions. E.g. It can take a uint8 and return an "escaped" uint8, where every 7 bits, a "spare" bit is added for encoding reasons. (e.g. 11111111_11111111 => 11111110_11111110_11000000 as well as the "unescape" direction 11111110_11111110_11000000 =>11111111_11111111)
The order of "when" checks could be a problem, as well as the error handling for a "when" check that throws an error. I have a similar encoding API for a custom encoding for python json but I exposed them by having the "when" functions be a key in an override table (https://github.com/jeff-hykin/json_fix#override-table). I think it is important to expose the list of how many checks there are, and what order they're being executed in, so that devs aren't surprised by the behavior. And I think the most intuitive default is to have the most recent "when" be the first one that is checked.

lilnasy commented 1 year ago

>1. ...have named encoder/decoders and hash...

@jeff-hykin That's a great idea! We could check when `createCodec` that all names/IDs are unique, and maybe use the strings directly as tags so we don't have to worry about hashing or hash collisions.

>2. ...if the last bit of the uint8 equals 1 then, get another uint8... > (regarding longer than one byte type tags)

Varint encoding is how array/string lengths are encoded now, and I'm considering it for numbers as well: #17. I guess numbers and other primitives could be allowed as the ID of the extension for cases where strings are seen as too wasteful.

>3. ...as well as the error handling for a "when" check that throws an error...

I'm not sure if recovering from an error there should be `es-codec`'s responsibility.

>3. ...expose the list of how many checks there are...

Unlike `json.override_table` in json_fix, the entire extensions array is provided by the user. They should have complete control over it (and no control over how base types are processed.)

This is the revised API:

const { encode, decode } =
    EsCodec.createCodec([
        {
            name: "URL",
            when  (x    : unknown) : x is URL         { return x.constructor === URL },
            encode(url  : URL    ) : BaseSerializable { return url.href },
            decode(href : string ) : URL              { return new URL(href) }
        }
    ])

name is used as the type tag: order does not matter anymore
when is still a function: this allows an extension to add support for only a subsection of a type. For example, a symbol extension can only support registered symbols (Symbol.for("xyz")), leaving well-known symbols (Symbol.toPrimitive) up for grabs by another extension.
encode returns a type that is natively supported, instead of an ArrayBuffer. The alternative would mean exposing too many implementation details.

lilnasy commented 1 year ago

After having used this API, I believe effectful encoding/decoding needs to be made easier. The use case I'm concerned with is where I need to "register" a virtual ReadableStream within a websocket.

It is trivial to create effects in the global scope, but to use a websocket connection in an encode/decode function, I had to call createCodec for each connection such that it captures the websocket in scope and becomes a closure.

Inspired by the official msgpack javascript package, context for extensions might look like this:

const { encode, decode } =
    EsCodec.createCodec<{ socket : WebSocket }>([
        {
            name: "URL",
            when  (x    : unknown, context) : x is URL         { return x.constructor === URL },
            encode(url  : URL    , context) : BaseSerializable { console.log(context.socket); return url.href },
            decode(href : string , context) : URL              { console.log(context.socket); return new URL(href) }
        }
    ])

const buffer = encode(data, { socket : new WebSocket(url) })
const data = decode(buffer, { socket : new WebSocket(url) })

jeff-hykin commented 1 year ago

Sorry I've been gone for a bit. I think the new proposed API looks good. There is still ambiguity on when the "when" clauses are run (are the in-order top to bottom or bottom to top?). I think supporting native types is a fantastic addition that makes it so much easier to use.

I don't quite understand the websocket issue (why do the console logs need to be in the encode/decode?)

lilnasy commented 1 year ago

There is still ambiguity on when the "when" clauses are run (are the in-order top to bottom or bottom to top?)

@jeff-hykin They are top to bottom. If the 1st extension's when returns true, 2nd extension's when is not executed.

In practical terms, this means that more "specific" extensions should be placed near the start of the array you pass to createCodec.

I don't quite understand the websocket issue (why do the console logs need to be in the encode/decode?)

The console.log serves as an example for how you would affect the outside world.

The MessagePack readme provides a longer example where the context is something that keeps track of everything being encoded/decoded.

It's for cases where the object you're serializing is not an unchanging static object (ReadableStream, for my use case).

It's needed because it's easy to affect a variable declared at the top-level, but it gets tricky when you need to affect something that's passed in as an argument somewhere (a WebSocket connection, for my use case).

jeff-hykin commented 1 year ago

In practical terms, this means that more "specific" extensions should be placed near the start of the array you pass to createCodec.

Great, so long as this is mentioned prominently in the docs, I don't see any problem with the API.

Even better, this design makes it easy to define a codec in a standalone file, publish it on Deno.land, and then import it wherever it is needed (e.g. sending someone a serlized file along with a import codec from "somewhere on deno.land")

It's for cases where the object you're serializing is not an unchanging static object (ReadableStream, for my use case).

~Hmm I'm still not understanding when it would be desireable to seralize with side effects. If I were to seralize a websocket I'd imagine it something like:~

// sorry I'm simplifying to JS, its a bit easier for pseudocode
const { encode, decode } =
    EsCodec.createCodec([
        {
            name: "Websocket",
            when: (x)=> x instanceof WebSocket,
            encode: (websocket) => websocket.url,
            decode: (websocketUrl) => new WebSocket(websocketUrl)
        }
    ])

~So I don't really know what kind of logic would go into a context.track() function. For a readable stream I would imagine maybe saving/loading an index, but I don't really understand why there would need be side effects.~

jeff-hykin commented 1 year ago

Actually I think I see what you mean. If encode is partially-encoding (e.g. iterably encoding) a value, it would be nice for it to pick up where it leaves off:

const socket = new WebSocket("stuff")
const localFileStream = new ReadableStream("local stuff")

const { encode, decode } = /* ... */

// server.js: normal way
for await (const chunk of localFileStream) {
    socket.send(encode(chunk))
}

// client.js: normal way
let chunks = []
while () {
    chunks.push(
        decode(await socket.recv())
    )
}

// ^ kind of having to manually write encode/decode logic

I think there is a lot of value in having a pure-function encoder/decoder, so I feel like a streaming encoder/decoder is a different problem (e.g. seems like a feature beyond es-codec@1.0.0 to me and/or deserves its own stream-encoding API that maybe interally utilizes the pure-function encode/decode).

In terms of the streaming API, hear me out, I think iterables might cover all cases (meaning managing a context could be unnecessary). Iterables are pretty much just a streamUUID & metadata plus individually seralizable chunks (even if chunks themselves are iterables; e.g. needing recursive seralization). Lets say there are two async readable streams stream1 and stream2. The chunks of stream1 and stream2 won't necessairly be sent in a strict way (e.g. it might be stream1.chunk1, stream1.chunk2, stream2.chunk1, stream1.chunk3 instead of stream1.chunk1, stream2.chunk1, stream1.chunk2, stream2.chunk2) so the deseralizer will need to handle processing the ID, while making sure all the chunks end up back in the right stream. Variable-ordering alone seems to me like a case for having two different API's (e.g. normal encode should be a deterministic process, while stream-encoding isn't necessairly deterministic)

That encode/decode logic above (encoder labelling chunks with a streamID, then decoder sorting chunks using that streamID) seems very general to me, so having each dev re-implement it with their own custom Context object seems suboptimal. Maybe I'm wrong, especially for "connection" things since they're more than a just streamID + metadata.

I'll need more time to think about this though, I will be facing this problem myself soon enough.

lilnasy commented 1 year ago

@jeff-hykin This is the use case I made es-codec to enable. In the highlighted line is a "server function" that's going to be called from a browser. https://github.com/lilnasy/astro-server-functions/blob/streams/example/src/serverfunctions.ts#L5

Iterables are pretty much just a streamUUID & metadata plus individually seralizable chunks

Exactly! That is almost verbatim how I've implemented it here (the code hasn't been updated to use context.) https://github.com/lilnasy/astro-server-functions/blob/streams/client-runtime.ts#L28

I think there is a lot of value in having a pure-function encoder/decode

I agree with you. Although, javascript doesn't have a way to enforce this, certainly not from a library. Besides, you can always choose to stick with context-free pure encoding (the URL example still works.)

lilnasy / es-codec

Extensions API: A way for users to broaden supported types #16

API

Usage