ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

116 stars 12 forks source link

Allow strongly typed arrays to contain strongly typed arrays #43

Closed edgar-bonet closed 10 years ago

edgar-bonet commented 10 years ago

Hi!

The current (2014-05-23) version of the specification seems to only allow primitive types (called “value” types) inside strongly typed arrays. As an enhancement, I am requesting that strongly typed arrays be acceptable as elements of strongly typed arrays. This would provide optimization for multidimensional arrays, without resorting to any new syntactic construct.

Example

I have (I really do) a few huge JSON files, each holding, among other things, a thousand arrays that look like this:

[
    [1.23, 4.56],  // this is one datapoint
    [7.89, 0.12],  // another datapoint
    // a few thousand more datapoints...
]

I am looking for a more efficient, JSON-compatible, binary representation of the same data. “Unoptimized” UBJSON yields:

[[]
    [[] [d][1.23] [d][4.56] []]
    [[] [d][7.89] [d][0.12] []]
    // a few thousand more datapoints...
[]]

With this representation, each datapoint costs 8 bytes of data (float32 is enough precision for me), plus 4 bytes of overhead. That's 50% overhead, not so good.

The same array as “optimized” UBJSON is:

[[]
    [[][$][d][#][i][2] [1.23] [4.56]
    [[][$][d][#][i][2] [7.89] [0.12]
    // a few thousand more datapoints...
[]]

Now we have 8 bytes of data + 6 bytes of overhead per datapoint. That's 75% overhead, so the optimization is obviously not good for these small inner arrays.

Per the current proposal, the outer array can also be optimized, which yields the following “recursively optimized” UBJSON:

[[]                 // this is an array
    [$][[]          // of arrays
        [$][d]      // of float32
        [#][i][2]   // inner arrays have length 2
    [#][I][3200]    // outer array has length 3200
    [1.23] [4.56]   // first datapoint
    [7.89] [0.12]   // second datapoint
    // a few thousand more datapoints...

Now we have a really optimized layout with zero overhead.

And importantly, we are not introducing any new syntax, but only specifying that the “type marker” of a strongly typed array is:

[type of array] = [[][$][type of elements][#][length of array]

In the above example, the type marker of the outer array ("[$[$d#i<2>#I<3200>" for short) would be recursively parsed as:

level 0: [$ ┐                   ┌→ #I<3200> = array of length 3200
level 1:    └→ [$ ┐    ┌→ #i<2> ┘           = arrays of length 2
level 2:          └→ d ┘                    = float32

Regards, and thanks for the good job.

ghost commented 10 years ago

@edgar-bonet I like the win, in this particular case, but I am having a hard time wrapping my head around the recursive implications here... for example a 3D or 4D array?

I understand the request in a 2D case:

<define header for outer and inner array>
  <data payload>

When I think of allowing 0+ definitions of pairs of type-size ($-#) and each one corresponds to a lower and lower level of array, it puts more and more burden on the parser or generator to be context aware.

I'm not against this, I'm just thinking through it out loud... would like others to weigh in as well.

(Restating examples so we can reference them for further discussion...)

1. JSON (40 bytes)

[
    [1.23, 4.56],
    [7.89, 0.12],
    [3.14, 0.08]
]

2. Plain UBJSON (38 bytes)

[[]
    [[][d][1.23][d][4.56][]]
    [[][d][7.89][d][0.12][]]
    [[][d][3.14][d][0.08][]]
[]]

3. Optimized Container UBJSON (47 bytes)

[[]
    [[][$][d][#][i][2][1.23][4.56][]]
    [[][$][d][#][i][2][7.89][0.12][]]
    [[][$][d][#][i][2][3.14][0.08][]]
[]]

edgar-bonet commented 10 years ago

Well, the definition

[type of array] = [[][$][type of elements][#][length of array]

being recursive, it allows for arrays of any dimension. I am guessing that any UBJSON implementation already relies heavily on recursion. This proposal just requires a little bit more of it.

I've written a not-so-clean Qt-based implementation, but I do not have access to it now (it's in my office). Anyhow, for the parser, the general idea is to have one recursive function for parsing the “type signatures”, and a second recursive function for parsing the actual data. In JavaScript-style pseudo-code:

function parse_type_signature(byte_stream)
{
    if (byte_stream starts with "[$") {  // fixed type array
        advance byte_stream by 2;  // consume "[$"
        var inner_type = parse_type_signature(byte_stream);  // recurse!
        assert(byte_stream starts with "#");
        advance byte_stream by 1;  // consume "#"
        var count_type = parse_type_signature(byte_stream);
        assert(count_type is some integer type);
        var count = parse_data(count_type, byte_stream);
        return {
                type: FIXED_TYPE_ARRAY,
                element_type: inner_type,
                element_count: count
            };
    }
    else {
        // other types...
    }
}

function parse_data(type, byte_stream)
{
    if (type.type == FIXED_TYPE_ARRAY) {
        var data = new Array of length type.element_count;
        for (var i = 0; i < type.element_count; i++)
            data[i] = parse_data(type.element_type, byte_stream);
        return data;
    }
    else {
        // other types...
    }
}

ghost commented 10 years ago

@edgar-bonet I was letting this proposal marinade a little bit in my brain - the space-optimization win is too big to ignore and I've since come around on the upfront definition of the container for an X-dimension set of containers... it is the nature of the strongly typed container anyway, so there is no diversion from what we already have.

I'm asking for feedback from the community on this, but overall I think a great proposal.

meisme commented 10 years ago

Since the strongly typed arrays only can contain one type of data, these cases can always be "unnested" for serialization, moving the complexity from the (de)serializer to the application. The previous example then becomes

[[][$][d][#][i][6]
    [1.23][4.56][7.89][0.12][3.14][0.08]

This is just 30 bytes (of which 24 is the floating point data itself). Personally I don't see the need for introducing the suggested complexity since it will inevitably use more overhead than this. There might be a practical use-case for nested strongly typed objects, though, I haven't thought that all the way through yet.

kxepal commented 10 years ago

While idea looks good, I fear that we'll make UBJSON too complex and ugly with attempt to handle all possible cases with "optimized" constructions. UBJSON should provide plain and simple primitive to easily and effective describe data structures and be error-prone in the same time. I feel we're drifting away from that original idea since Draft 10 (I'll open issue to discuss this a bit later).

Actually, you'd like to see only one case that would be optimised: array of pairs. It's very common case for representing: geo data, money, node relations (A->B), ip-port, list of properties (strings). This could be easily done with new container P (pair) which expects the following byte as element type marker (primitives only) and which is the only allowed to be used in array type description:

Plain array - 44 bytes: 2 + N(4 + 2S)

[[]
    [[][d][1.23][d][4.56][]]
    [[][d][7.89][d][0.12][]]
    [[][d][3.14][d][0.08][]]
[]]

Plain array of pairs - 38 bytes: 2 + N(2 + 2S)

[[]
    [P][d][1.23][4.56]
    [P][d][7.89][0.12]
    [P][d][3.14][0.08]
[]]

Optimized array of pairs - 34 bytes: 4 + 2NS

[[][$][P][d]
    [1.23][4.56]
    [7.89][0.12]
    [3.14][0.08]
[]]

As for N=1000 the difference becomes more significant: 14002, 12002 and 10004 bytes respectively. As for old good JSON it's around 15001 bytes (as for array of strings since we have to preserve accuracy). As for any other cases I don't feel good about overengineering UBJSON to handle X-dimension arrays. Let's make it more suitable for real world problems and let him doing simple things, but in easiest and effective ways that possible.

UPDATE: Yes, you may argue that there are many other cases where nested arrays may be applicable. And you'll be right: RDF describes data in triples, so pairs wouldn't fit it well. However, that's the only one exception that popular as well as pairs comes to my mind and actually you cannot optimize RDF triplets with UBJSON since these elements are string of variable length.

kxepal commented 10 years ago

But again, all nested arrays optimizations are requires two rounds serializer or predefined data scheme: you cannot apply these optimization while you're not sure if all array's elements has common data type - that's gives you O(2*N) serializing cost and the size benefit should worth that. On real world high load tasks nobody will do that as well as everybody use own line-per-element protocol for JSON data instead of read-and-decode strategy.

edgar-bonet commented 10 years ago

@meisme: Unnesting would not work for me. I have an application that is already in production and works well with JSON, except that the generated files can occasionally get too big to be practical. Users are familiar with the data model, which is “natural” for their use case. I just want to provide a more efficient serializing format in a way that is transparent to the users, thus without changing the data model. And this is a major selling point of UBJSON: it is supposed to work transparently with your current JSON data model.

@kxepal: Most of the time I will indeed have long arrays of pairs. However, occasionally these can be triplets. Typically, when acquiring data through a lock-in detector I have things like

[external_parameter, signal_in_phase, signal_in_quadrature]

The arrays could even be longer if the user sweeps several parameters simultaneously.

About the complexity: I do not think the complexity for the serializer should be an issue. As with any optimization, it is important that the format stays simple for the parser if the parser has to remain universal. The serializer, even a universal serializer, is free to ignore any optimizations... or implement whatever may make sense for the application.

edgar-bonet commented 10 years ago

@kxepal: More on the complexity of the serializer...

I just realized that the O(2N) complexity has nothing to do with this proposal: it's already in the current version of the standard if one wants to leverage the optimization afforded by fixed-type arrays. Here is the algorithm I use for deciding what type to serialize some data into:

/*
 * Find a common supertype of a and b.
 */
function merge_types(a, b)
{
    if (a == b) return either;
    if (either a or b is a subset of the other)
        return the "biggest" type;
    if (a == INT8 and b == UINT8 or vice-versa)
        return INT16;

    // The only part specific to this proposal:
    if (both are FIXED_TYPE_ARRAY with same length) {
        var inner_type = merge_types(a.element_type, b.element_type);
        if (inner_type == INVALID) return INVALID;
        else return FIXED_TYPE_ARRAY of inner_type;
    }

    otherwise return INVALID;
}

/*
 * Get the proper type for serializing this piece of data.
 */
function get_type(data)
{
    if (data.isArray()) {
        var inner_type = NOP;  // compatible with all other
        for each (element of data) {
            var element_type = get_type(element);
            inner_type = merge_types(inner_type, element_type);
            if (inner_type == INVALID) return PLAIN_ARRAY;
        }
        return FIXED_TYPE_ARRAY of inner_type;
    }
    else {
        // other types...
    }
}

As stated in the comments, the only part specific to this proposal is where I try to decide whether two fixed-type arrays have compatible types. I do so by having merge_types() call itself recursively. The recursion is not expected to go deep: only as deep as the dimensionality of the multi-dimensional array.

This is for the universal serializer. It is intended to serialize whatever the user throws at it. I also have a situation where the data is generated by the application itself (a data acquisition process actually) and I know before hand what kind of data to expect. In this case I can output the known type marker of the multidimensional array and then just stream the data in real time with no size overhead, and a very low processing cost (cast to float, convert to big endian, output).

meisme commented 10 years ago

Just to throw it in there, for most strictly statically typed languages (eg C++ which I'm in the process of writing a (de)serialiser for) types (including nested ones) can often be known at compile time, meaning the that the only increase in complexity will be for interpreting the data.

Steve132 commented 10 years ago

N-D arrays of raw binary data DO have a pretty significant use in scientific computing and many other data sets....but I'm honestly VERY concerned with the amount of extra logical complexity in the parser and reader this proposal will introduce. It also sorta starts to lose 1-1 parity with javascript.

One 'middle ground' proposal that I would prefer dramatically would fix all the use cases here while simultaneously getting rid of the complexity of the parser and the writer and give us all the performance benefits is as follows:

Call them 'Multi-dimensional Strongly-typed arrays'. What we do is we simply make it so that after the size information in the STC array header there are 0 or more '@' patterns that optionally specify additional array dimensions for this container. (we cant' reuse '#' because it introduces an ambiguity in the parser with a one-d STC array of sized objects)

For example, here's OPs proposal,extended to a 3D array (like 3200x3200x3 bitmap)

[[]                 // this is an array
    [$][[]          // of arrays
    [$][[]      // of arrays
         [$][d]      // of float32
         [#][i][3]   // inner arrays have length 3
    [#][I][3200]    // middle array has length 3200
    [#][I][4223]

    [1.23] [4.56] [4.56]    // first datapoint
    [7.89] [0.12] [4.56]  // second datapoint
    // a few thousand more datapoints...

Thats fine, but recursively parsing and validating that is a LOT of complexity and a total rewrite of the parser and it doesn't work if ANY of the recursive elements don't parse properly. It's 18 bytes for the header.

In contrast, heres the MSTA version

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223][@][I][3200][@][i][3]  //first dim is 4233,second dim is 3200, third dim is 3

    [1.23] [4.56] [4.56]    // first datapoint
    [7.89] [0.12] [4.56]  // second datapoint
    // a few thousand more datapoints...

It's stupid easy to integrate into the parser AND the writer with no recursion or multiple pass parsing, and has EVEN MORE storage savings than the proposal we are discussing ( the header is 14 bytes), and gives us ALL the performance benefits. Parsing is a LOT easier and the data structure for the type does not need to be recursive or recursively parsed

kxepal commented 10 years ago

More fun: array of arrays of objects with arrays of array of object with array of strings. Solution?

[[{"foo": [[{"bar": ["baz"], "boo": [...], ...}, ...], ...], "...": [...]}, ...], ...]

I start to feel that the next step after strong header section will be full scheme definition which will describe the data structure while parse will just read it without need to lookup markers, length, etc. But don't we inject application level logic into data format?

kxepal commented 10 years ago

THB, I'm +1 for idea for having "data scheme map". I opens a lot of doors for having more compact format without making it ugly and complex.

Steve132 commented 10 years ago

kxepal: I'm highly against a schema. Its really really really complex. Nobody wants to parse and validate the schema before loading something and trying to maintain the proper schema while writing tools is horrible. I hate that idea. I can't even imagine writing a parser.

kxepal commented 10 years ago

@Steve132 what's you had proposed is a schema, but for specific data type (:

Steve132 commented 10 years ago

Arrays of arrays of objects with arrays of array of object with array of strings is already supported perfectly fine in draft 10. Any subset of them can be sized or typed. Not problem

Steve132 commented 10 years ago

@kxepal: I know that but, in a sense, ALL 'formats' are schemas. I'm only addressing the very real performance need for a multi-dimensional array by adding exactly that: a multi-dimensional array. It's a small change as opposed to a massively complicated DSEL

edgar-bonet commented 10 years ago

@Steve132: I have two issues with your proposal.

The minor issue is that I am not convinced that recursion is all that complex, especially given that any parser is probably already completely recursive. Is that more complex than introducing a new syntax (the [@]) for a new concept (Multi-dimensional Strongly-typed arrays)?

The major issue is that your proposal makes the format ambiguous. How can you tell apart this one-dimensional array:

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223]    // length is 4233
    [3.14138794]    // first datum
    [1.23]          // second datum
    // etc...

from this 2D-array:

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223]    // first dim is 4233
    [@][I][3200]    // second dim is 3200
    [1.23]          // first datum
    // etc...

given that [@][I][3200] and [3.14138794] have the same binary representation?

meisme commented 10 years ago

I much, much prefer this way of doing it. @edgar-bonet is right, though, the definition needs to be

[[]
    [$]<type>
    [@]<x>
    [#]<y>
    <data>

in order to be unique.

kxepal commented 10 years ago

Why use multiple markers to define same thing? What if you'll need third dimension? forth? What if data types on X and Y axis would be different? How would you define complex nested containers?

I see for now too complex solutions for too specific case. Data format should be flexible and cover more common cases. Otherwise no one will use these optimizations due to they creates too much restrictions and may easily cause unpredictable behaviour: for one dataset our optimizations are applicable, result binary data is compact, transfers fast, decodes fast, for another the same dataset with a bit different values suddenly it starts to require more space to store and more time to transfer and processing. That's not a good situation.

meisme commented 10 years ago

@kxepal You have a point, however different data types on different axis is not supported with the current recursive strategy either. If that were to be supported you'd almost certainly need to introduce a touple type. That might not be such a bad idea, though. Something like

[T]
    [#]<n>
    [$]<first type><second type>...<n-th type>
    <data>

Where a touple is considered a primitive value. You could then use it in an array like so

[[]
    [$][T]
        [#][i][2][$][d][d]
    [#][i][3]
    [1.23][4.56]
    [7.89][0.12]
    [3.14][0.08]

Naturally you'd also be allowed to specify a touple as one of the types in a touple

edit: For converting UBJSON to JSON, a touple would always expand to an array with n elements

edgar-bonet commented 10 years ago

OK, now that I have an implementation that I like, I think I can say more about the complexity. But before, a question for @kxepal: Since your argument against the optimized encodings applies also to the strongly typed arrays we already have, are you proposing to drop them?

Back to the issue: The C++ implementation of the parser and serializer is not substantially more complex than the algorithms I already posted here in my first and third comments... Except for one thing: the parsed type signature is now a recursive structure, which means I had to deal with pointers. In C++, I did it more or less along these lines:

struct UbjsonType {

    // The values of the enum match the markers for easier parsing.
    enum Type {
        NOP   = 'N',
        FALSE = 'F',
        TRUE  = 'T',
        UINT7 = '7',  // internal use only
        INT8  = 'i',
        UINT8 = 'U',
        // ...
        ARRAY = '[',
        FIXED_TYPE_ARRAY,
        OBJECT = '{'
    } type;

    // The fields below are only relevant for FIXED_TYPE_ARRAY.
    UbjsonType *element_type;
    int element_count;

    // Constructor.
    UbjsonType(enum Type t, UbjsonType *subtype = 0, int count = 0)
    : type(t), element_type(subtype), element_count(count) {}

    // Destructor.
    ~UbjsonType() { delete element_type; }
};

Also, since the struct owns the pointer to the nested type, I had to implement The Big Three, which I did by means of copy and swap. This was for me the hardest part, mostly because I am not very proficient in C++ and thus I had to learn about some subtleties such as copy constructors and copy elision optimization. I guess all this should sound trivial to a more seasoned C++ programmer.

Once this struct works, the rest is just a straightforward, almost verbatim translation of the algorithms above in C++.

Ah, just a word about the UINT7 type above, although it's unrelated to this proposal. I defined this type for numbers that can be stored as either UINT8 or INT8. When one such number ends up in a strongly typed array, it gets serialized into whatever type is also good for the other elements of the same array. Marking those numbers as UINT7 is a way to defer the final decision until we know more about the other elements of the array.

Just for fun, I also wrote a special parser function that can only read optimized 2D arrays of FLOAT32. This function saves me two levels of recursion when reading such arrays, and it also saves many file reads as it slurps the whole 2D array into memory in one read. Well, these optimizations only got me about 11% speedup, which for me means that all this recursive stuff is really not that expensive.

Steve132 commented 10 years ago

@meisme and @edgar-bonet are right: In order to not be ambiguous, we have to have the @ come BEFORE the #. I didn't realize that.

Additonally, instead of each @ being a DIM, the @ could specify the NUMBER of dimensions incoming....but that’s optional, I don't care either way....

@kxepal: I'll answer your questions:

Why use multiple markers to define same thing?

Because if you don't then a multiple-dimensional array forms a number of parser ambiguities...including the one @edgar-bonet pointed out where the binary form of multiple # in a multi-dimensional array specifier is indistinguishable from a number in a one-dimensional array specifier.

What if you'll need third dimension? forth?

A four-dimensional array of floats (12x24x4x10):

[[][$][d][@][i][12][@][i][24][@][i][4][#][i][10]
      [3.21][323.23] .. more data

What if data types on X and Y axis would be different?

This is a nonsensical question. Even the recursive version of the strongly typed arrays containing strongly typed arrays means that, eventually, there is exactly ONE type. You can't have different types along different axis. If I'm wrong, please provide a multidimensional strongly typed array that shows what you mean.

How would you define complex nested containers?

What complex strongly-typed containers could you NOT define by this proposal? Draft 10 already defines complex nested containers of dynamic size, the only advantage of this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers. Can you please provide an example of the kind of container you do not believe you would be able to represent, because I can't think of one.

@edgar-bonet: Please correct me if I'm wrong, but can you think of ANY data structure that you could represent by implement the 'recursive' version of the proposal that is NOT equivalent semantically to a multi-dimensional array? Because the way I see it, the ONLY power this proposal adds is the ability to represent a statically-typed multi dimensional array and the additional performance that gives us. It gives us no additional power unless I'm mistaken.

GIVEN that argument, then we can look and see what the simplest and cheapest way to implement a multi-dimensional array is in the spec, that gives us the fewest number of bytes and the simplest parser and writer...and given that, it seems to me that my proposal for a multi-dimensional array is FAR simpler.

Consider, in your implementation example, implementing the multi-dimensional array requires overloading the type of the "TYPE' marker from a simple enum to a complex parsed type marker structure. This would dramatically change the reader and writer to do two passes of recursion, adding context-sensitivity to the parser and a TON of complexity for context sensitivity to the writer, and require everywhere that a TYPE marker is used or checked in the code to check if it is one of the base type enums or a data structure containing the recursive fingerprint for the array, everywhere that is used.

In contrast, my C implementation of the parser is simple to incorporate this change EASILY with changing:

static inline void priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);

    if (nextchar == '$')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar = priv_ubjr_context_peek(ctx);
    }
    else
    {
        *typout = UBJ_MIXED;
    }

    if (nextchar == '#')
    {
        *sizeout = priv_ubjr_read_integer(ctx);
    }
    else
    {
        *sizeout = 0;
    }
}

static inline int priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);
    int cdim;
    if (nextchar == '$')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar = priv_ubjr_context_peek(ctx);
    }
    else
    {
        *typout = UBJ_MIXED;
    }

    for(cdim=0;nextchar=='@';nextchar=priv_ubjr_context_peek(ctx))
    {
        sizeout[cdim++]=priv_ubjw_read_integer(ctx);
    }

    if (nextchar == '#')
    {
        sizeout[cdim++] = priv_ubjr_read_integer(ctx);
    }
    return cdim;
}

By adding only a few lines grabbing the dimensions, the parser now works wihtout changing anything aboput the API...the amount of data to be read in one call to fread is simple to calculate linearly by calculating the product of the array sizes from 0 to cdim.

The writer is similarly trivially updated by simply outputting the appropriate number of '@' symbols in the header..a single for loop.

Again, this is a MUCH simpler way of representing a multi-dimensional array that doesn't require implementing a context-aware recursive descent parser and representing types as a dynamic data structure that has to be validated and parsed in two passes. It's also got better space performance, as it uses fewer bytes.

Considering that we don't get any additional power out of the recursive type proposal over the multi-dimensional array, it seems better to do it this way.

kxepal commented 10 years ago

@kxepal: Since your argument against the optimized encodings applies also to the strongly typed arrays we already have, are you proposing to drop them?

Yes, that's the most ugly part of UBJSON. I'm working on RFC with attempt to fix that, returning format back to Draft 9 days when it was simple, clean and beauty, but effective. Sorry, but having one type which acts in 3 different ways (unsized array, sized array, strongly typed array) and now you're trying to add another one (multidimensional) - that is the biggest mistake. Better take in this case MessagePack or BED if you're wanted for compacted data.

How would you define complex nested containers?

What complex strongly-typed containers could you NOT define by this proposal? Draft 10 already defines complex nested containers of dynamic size, the only advantage of this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers. Can you please provide an example of the kind of container you do not believe you would be able to represent, because I can't think of one.

this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers key phrase of this issue. Noted. We're on different pages. You trying to solve short specific case while I'm trying to find more common solution. Never mind. Sorry for wasting your time for me.

Steve132 commented 10 years ago

I agree with you that this performance boost is the key point of this issue and I agree that it is desirable enough to warrant proposing a change for. The point I'm making is that the same performance boost can be gotten from ANY n-d array semantic, and furthermore that the proposed recursive type stuff doesn't provide ANY more semantic power than any other N-d array semantic. Therefore what we are actually discussing is "an Nd array is desirable. Therefore what/how should we implement it". The original proposal is one solution, but it is a particularly complex and inefficient way of representing an Nd array compared to my proposal

meisme commented 10 years ago

Depending on what we want accomplished my vote goes for either the introduction of a touple type more or less as I outlined or of @Steve132's multi dimensional array with @ specifying number dimensions followed by one # marker per dimension. I'm fine with either of these choises.

edgar-bonet commented 10 years ago

@kxepal: I agree that the specification would be simpler and leaner with a single array type. And I would find it really attractive if I did not have all these big matrices to deal with. I also fail to see the use for the fixed-size-variable-type arrays. Maybe you are right, simplicity is the main selling point for UBJSON against its competitors...

Better take in this case MessagePack or BED if you're wanted for compacted data.

Thanks for the link to BED! I missed this one when searching for a binary alternative to JSON. BED's dictionary of symbols looks really attractive for when you have lots of small objects sharing the same structure. However, BED does not fit the bill for me, as it is not JSON-compatible (no nulls, nor booleans). It also lacks float32 and does not provide optimization for fixed-type arrays, not even in 1D.

MessagePack came close second in my list of options. It generally does a better job than UBJSON at optimizing for size, at the cost of increased complexity. I settled on UBJSON because it is simpler than MessagePack and because, at the cost of a simple (in my mind at least) amendment, it is even more compact for the kind of data I have. The amended UBJSON was then both simpler and more compact than MessagePack, almost too good to be true. ;-)

I have to admit I am completely focused on my specific use case. In my application, I documented the format as being UBJSON (link to ubjson.org) with one amendment (link to this thread). If the proposal gets rejected, it will stay like this, which is fine for me. However, if the variation proposed by @Steve132 gets accepted, I will probably change my code to be standard-compliant.

@Steve132: Of course your variation on my proposal is semantically equivalent to mine. BTW, just like @meisme, I better like your last iteration, where [@] introduces the number of dimensions (default = 1) and [#] the size along one dimension. I find it more regular than [@] for each size but [#] for the last one of them.

we can look and see what the simplest and cheapest way to implement a multi-dimensional array is in the spec, that gives us the fewest number of bytes and the simplest parser and writer

Completely agree, although I would strongly prioritize the goals:

Simplicity of the parser is the most important because a universal parser has to know about all the variations of the data format.
Simplicity of the serializer. This is less important because a universal serializer is free to ignore the optimization option, and because an application-specific serializer would know the data structure before hand, making either version of the proposal trivial to implement. I also think that an application that generates huge data would almost always want a zero-copy application-specific serializer in order to avoid handling copies of the data to the serializer.
Size of the header. This is the least important because the optimization is most useful for big matrices of data, and then the cost of the matrix header is largely amortized.

As for the simplicity, I do not understand your argument about context-sensitivity. I do not even understand what you mean by this word. As I understand it, the parser is always context-sensitive, as at any point in the stream it has to know whether it is expecting a type marker or a piece of data of a particular type.

Regarding recursion, you seem to imply that it is always intrinsically more “complex” than iteration. I do not see why. Actually I see recursion as a very simple and elegant way to solve some problems. And here we are dealing with the JSON data model, which allows for containers nested at any level. Recursion is really the most natural way of traversing such data structures.

sizeout[cdim++]=priv_ubjw_read_integer(ctx);

This array, provided by the caller and filled by the callee, is calling for a buffer overflow. If you do not want to impose arbitrary restrictions on the dimensionality, you should malloc()/realloc() the array inside the callee, and return both the pointer and the size. Then you end up returning (maybe as separate pieces) something that starts looking like:

struct {
    UBJ_TYPE typ;
    size_t* sizes;
    int cdim;
};

Admittedly, this is still simpler than my linked list of UbjsonType structs, but not by much. The equivalent of your code for parsing the matrix header looks like this inside my parse_type_signature() function (it's a case inside a switch):

if (byte_stream.peek() != '$') return UbjsonType(UbjsonType::ARRAY);
byte_stream.getChar();  // consume '$'
UbjsonType inner_type = parse_type_signature(byte_stream);
if (byte_stream.getChar() != '#') throw "'#' expected";
UbjsonType count_type = parse_type_signature(byte_stream);
if (!count_type.isInt()) throw "array length should be an integer";
int count = parse_data(byte_stream, &count_type);
if (count < 0) throw "array length should be positive";
return UbjsonType(
    UbjsonType::FIXED_TYPE_ARRAY,  // type
    new UbjsonType(inner_type),    // element_type
    count                          // element_count
);

If you remove the few sanity checks I've thrown for good measure, this does not look to me more complex than your code.

Also, this is just for parsing the array header, but how do you read the actual array data? We have a multidimensional array to fill. Normally I would do it like this:

QScriptValue array0 = QScriptValue::newArray(size[0]);
for (int i0 = 0; i0 < size[0]; i0++) {
    QScriptValue array1 = QScriptValue::newArray(size[1]);
    for (int i1 = 0; i1 < size[1]; i1++) {
       .
        .
         .
            QScriptValue arrayn = QScriptValue::newArray(size[n]);
            for (int in = 0; in < size[n]; in++) {
                QScriptValue value = parse_data(byte_stream, type);
                arrayn.setProperty(in, value);
            }
          .
         .
        .
        array1.setProperty(i1, array2);
    }
    array0.setProperty(i0, array1);
}
return array0;

where QScriptValue is a generic type that can hold any JSON-compatible data (a kind of a union, actually a “value” for a JavaScript engine I am using). However, this approach does not work if you do not know the number of dimensions at compile-time. Recursion makes this a lot simpler. Here is how I read the array in parse_data() once the type is known:

QScriptValue array = QScriptValue::newArray(type->element_count);
for (int i = 0; i < type->element_count; i++) {
    QScriptValue value = parse_data(byte_stream, type->element_type);
    array.setProperty(i, value);
}
return array;

Regards,

Steve132 commented 10 years ago

So I agree that we should change it to @n specifying an n-d array followed by n integers.

In order to address the buffer overrun issue I dont have a problem saying that the maximum value of n is 8 or some other reasonable number (ive never seen anything higher than a 5d array used for anything in the wild) this allows an implementation to be protected from attacks where someone puts @L500000000 in the stream.

My argument against the recursive thing is that its just a fancy way to parse/represent the number and size of the dimensions. It doesn't really represent anything else but its a lot more complex to validate and serialize. Even if you did want an nd array of dynamic arrays you can still do

[[][$][[] //3d array of dynamic arrays [@][3][i][5][i][6][i][5] //5X6X5 [[]....start of first array

You asked how you parse it...its easy. You represent an array as an nd array by default in the data structure but only have 1 dimension by default. Then when you calculate how much data to read you compute k=d1_d2d3..._dn then read k elements using the same code you use to read a linear array. If your code to read a linear array is already optimal, (meabing it reads sizeof (type)*k bytes for static sized types and recursively calls read() k times otherwise) then you dont have to change any code at all for the nd case.

By storing your nd array as a linear array you get performance benefits too. Its how most all nd array implementations in scientific software are already designed to work

Steve132 commented 10 years ago

@edgar-bonet: BTW, I wasn't arguing that recursion is ALWAYS more complex than iteration, just that a recursive specification of a ND array descriptor in the UBJ seems to be. For example, I personally think the recursive version of the "linear index" computation for ND arrays (that I mentioned in my last post) is simpler than the linear one.

  linear_index = x1+d1*x2+d1*d2*x3+d1*d2*d3*x4+...+d_{k-1}*x_k

//multi-dimensional array to linear array lookup
size_t linear_index(size_t* indices,size_t* dims,size_t num_dims)
{
    size_t cstride=1;
    size_t cdex=0;
    for(i=0;i<num_dims;i++)
    {
        cdex+=cstride*indices[i];
        cstride*=dims[i];
    }
    return cdex;
}

//recursive version
size_t linear_index(size_t* indices,size_t* dims,size_t num_dims)
{
    if(num_dims==0)
    {
        return 0;
    }
    else
    {
        return indices[0]+dims[0]*linear_index(index+1,dims+1,num_dims-1);
    }
}

AnyCPU commented 10 years ago

I agree with kxepal.

At current state of UBJSON it's seems that spec became looks like swiss army knife.

Simple tools - hammer and nails - it's good.

ghost commented 10 years ago

As always, thank you all for the detailed discussion - I've been digging through the thread and the different approaches discuss; I've been working through them and some alternatives in long-hand to see what the experience is like and I have a few thoughts and questions...

Point 1

I originally was for @edgar-bonet suggestion as-is, but as I started going through what that actually ended up looking like, it was less intuitive than I expected and I somewhat agree with @Steve132 point...

More specifically, your "header" ends up looking like this:

[{]                         // this is an object...
    [$][[]                  // of arrays...
        [$][{]              // of objects...
            [$][[]          // of arrays...
                [$][i]      // of int8s...
            [#][i][3]       // (array) of size 3
        [#][i][3]           // (object) of size 3
    [#][i][3]               // (array) of size 3
[#][i][3]                   // (object) of size 3

and I feel pretty strongly that it should look more like this:

[{]                             // this is an object
    [$][[][#][i][3]                 // of arrays (3 of them)
        [$][{][#][i][3]             // of objects (3 of them)
            [$][[][#][i][3]         // of arrays (3 of them... again)
                [$][i][#][i][3]     // of int8s (3 of them)

I think this parsing of pairs is mentally easier to track and when written down in this fashion, easier to understand/explain/follow.

Point 2 (a lot of examples)

Consider the following JSON as our baseline example for this proposal

JSON

646 bytes

{
    "array1": [
        {
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        }
    ],
    "array2": [
        {
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        }
    ],
    "array3": [
        {
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        },{
            "vals1": [16, 32, 64],
            "vals2": [16, 32, 64],
            "vals3": [16, 32, 64]
        }
    ]
}

Now we write it in normal UBJSON

UBJSON

425 bytes (34.2% smaller than JSON)

[{]
    [i][6][array1][[]
        [{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}]
    []]
    [i][6][array2][[]
        [{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}]
    []]
    [i][6][array3][[]
        [{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}][{]
            [i][5][vals1][[][i][16][i][32][i][64][]]
            [i][5][vals2][[][i][16][i][32][i][64][]]
            [i][5][vals3][[][i][16][i][32][i][64][]]
        [}]
    []]
[}]

Not bad, 34% size reduction just for getting out of bed in the morning (not even any optimizations).

Now let's rewrite it as UBJSON Draft 11 optimized container:

(Draft 11) Strongly Typed UBJSON

434 bytes (32.8% smaller than JSON, 1.4% BIGGER than standard UBJSON)

[{]
    [i][6][array1][[]
        [{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}]
    []]
    [i][6][array2][[]
        [{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}]
    []]
    [i][6][array3][[]
        [{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}][{]
            [i][5][vals1][[][$][i][#][i][3][16][32][64]
            [i][5][vals2][[][$][i][#][i][3][16][32][64]
            [i][5][vals3][[][$][i][#][i][3][16][32][64]
        [}]
    []]
[}]

Our optimized containers, while faster to read are 1.4% bigger... that's a bummer.

Now let's take @edgar-bonet proposal mixed with my modification and rewrite this and see where we get...

(Draft 12) Proposed Strongly Typed UBJSON

135 bytes (79.1% smaller than JSON, holy @#%$!)

[{]                                 // this is an object
    [$][[][#][i][3]                 // of arrays (3 of them)
        [$][{][#][i][3]             // of objects (3 of them)
            [$][[][#][i][3]         // of arrays (3 of them... again)
                [$][i][#][i][3]     // of int8s (3 of them)
    [i][6][array1]                  // in {}, array1 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
    [i][6][array2]                  // in {}, array2 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
    [i][6][array2]                  // in {}, array2 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values

What I like about this approach:

Almost 80% reduction in size with no compression
Much faster to parse
Intuitive approach of "defining header of container upfront"

What I don't like about this approach:

Parser must maintain more state and keep track of 'automatic' transitions between container scope even though a marker byte doesn't exist to trigger it.

The size reduction of this is a significant win... I don't think this can be ignored -- that said, I am not eager to make modifications to the spec that make it impossibly hard to implement in some languages.

What I would appreciate from you guys is to double check my logic and using your expertise in languages I am weak in, if you could eye ball this and see how the proposal may look implemented in your language of choice... not bad? horribly difficult? etc.

Also any general thoughts/comments always welcome.

AnyCPU commented 10 years ago

vals { row:30, column:3, data: { 16, 32, 64, 16, 32, 64 ... ... ... 16, 32, 64 } } ;)

edgar-bonet commented 10 years ago

This last iteration of the proposal looks simpler on paper but I am not sure it will be simpler to implement. In my original version, a strongly typed array has a “fully qualified type” that could be written BNF-style as:

<STA_type> ::= [ $ <type> # <count>

where the inner type is also a fully qualified type. This is a very simple derivation rule, and all its power stems from the fact that, since <STA_type> is a <type>, the rule can be recursively applied to any dimension.

Now, if I try to formalize this last version of the proposal as BNF-style rules, I get something like:

<STA_type> ::= [ <type_count_list>
<type_count_list> ::= $ <primitive_type> # <count>
                    | $ <container_type> # <count> <type_count_list>

where the inner types are not fully qualified, they are just single characters. Frankly, this looks more complex to me than the single rule above. And since the rules are more complex, I suspect it will be more complex to implement.

Also, if you want to know what kind of things you have in your container, you have to combine pieces of information that are not contiguous in this syntax. Namely the first <type> in the <type_count_list> and the trailing <type_count_list> after the first <count>.

OK, here comes a counter-proposal: in order to have a syntax that is both easy to read for humans and yet relying in a simple derivation rule, just exchange the positions of $<type> and #<count> in the first proposal. Thus:

<STA_type> ::= [ # <count> $ <type>

where again, the inner type is fully qualified. Following this rule, the header of your example becomes:

{ # i 3                // this is an object of length 3
  $ [ # i 3            // of arrays of length 3
      $ { # i 3        // of objects of length 3
          $ [ # i 3    // of arrays of length 3
              $ i      // of int8s

the problem I haven't worked through yet (that I am hoping you guys can weigh in on) is the complexity around all the implied transitioned state that is defined by the header -- specifically all the steps above labeled "enter" where the scope of the parsing has silently moved into a new scope by virtue of the parser keeping track.

There is a magic thing called “recursion” that makes tracking all those transitions absolutely trivial.

Steve132 commented 10 years ago

Your example doesn't speak at all to @edgar-bonet's original use case, however, which was to have arrays of arrays of arrays of arrays which would then allow largely contiguous data elements after.

That seems to be the primary 'purpose' of this proposal, not the example given here. Yes, if we change ubjson into a pre-schema language then sure we can save some bytes, but its an absolutely incredible amount of complexity for not a whole lot of savings except in somewhat pathological cases like this one that I'm not convinced really occur in the real world.

In contrast, I absolutely understand @edgar-bonet's need for an n-dimensional array of contiguous data, which absolutely shows up in the real world...ALL THE TIME. I myself also have this need which is why I was interested in this proposal.

ghost commented 10 years ago

@edgar-bonet count-before-type - I like this suggestion - I thought there was a reason I had defined it in the other order originally but that reasoning either doesn't exist or is slipping my mind. Either way I think it makes sense.

@Steve132 I don't understand the seemingly aggressive tone of your reply - my examples above are a visualization of @kxepal suggestion of a nightmare-scenario to show how the proposal would work. Applying it to @edgar-bonet original example makes it look near identical to his 3rd example in the original post. I didn't feel it necessary to restate that again as it was so similar.

To your 2nd and 3rd paragraph - help me understand the distinction you are making - you find the proposal I added above too complex, but if i remove support for objects from the proposal such that it allows all types and just arrays (to support n-dimensional arrays) I more or less end up with what @edgar-bonet proposed... and with which it sounds you were quite happy with -- so it is just the inclusion of supporting objects in this spec that upset you?

ghost commented 10 years ago

For the sake of keeping the conversation as neatly organized as possible (Data-heavy conversations like this are easy to get lost in)...

Combining my proposal above with @edgar-bonet modification would yield my final example of this:

[{]                                 // this is an object
    [#][i][3][$][[]                 // of arrays (3 of them)
        [#][i][3][$][{]             // of objects (3 of them)
            [#][i][3][$][[]         // of arrays (3 of them... again)
                [#][i][3][$][i]     // of int8s (3 of them)
    [i][6][array1]                  // in {}, array1 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
    [i][6][array2]                  // in {}, array2 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
    [i][6][array2]                  // in {}, array2 label
                                    // enter {} > [], type implied
                                    // enter {} > [] > {}, type implied
        [i][5][vals1]               // in {} > [] > {}, vals1 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals2]               // in {} > [] > {}, vals2 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values
        [i][5][vals3]               // in {} > [] > {}, vals3 label
                                    // enter {} > [] > {} > [], type implied
            [16][32][64]            // in {} > [] > {} > [], int8 values

This would be my final proposed form which addresses the two changes to STCs:

Change the ordering of type-count to count-type for strongly typed containers.
Allow strongly typed containers to contain other strongly typed containers.

One more thing...

In parallel to this @kxepal has been working on an alternative proposal to STC's in general which has one big advantage over what we have in UBJSON currently -- chunked/mixed-type containers.

The underlying premise behind @kxepal proposal is more flexible and more aligned with JSON (mixed-typed containers are entirely legal)... distilling the idea down to it's core component is allow the definition of a "header" (like what I have at the top of the example above) at any point in a container.

For example a single array may have 4 headers defined in it as a server is streaming content to the client... the types might even change during that streaming.

Before ratifying what I have above into Draft 12 - I wanted to get a chance to consider that one big distinction from @kxepal proposal and see if we wanted to go down that path with Draft 12 before defining UBJSON 1.0 -- the flexibility is a huge win, but as is the case with all computer -- flexibly means complexity... I don't think it is a horrific amount of complexity above what I have above (and it might be worth it) but I wanted a chance for us to think about it.

Alex if you don't get a chance, I can try and crystalize my interpretation of your proposal, following the basic structure above and formalize it into a proposal issue.

Your RFC (as it is now) is a significant change to UBJSON in notation and not something I would try and tackle in a single Draft revision, but I think the crux of your proposal is brilliant.

Steve132 commented 10 years ago

I'd like to clarify I'm not trying to sound aggressive. Just passionate, I guess :)

@Steve132 I don't understand the seemingly aggressive tone of your reply - my examples above are a visualization of @kxepal suggestion of a nightmare-scenario to show how the proposal would work. Applying it to @edgar-bonet original example makes it look near identical to his 3rd example in the original post. I didn't feel it necessary to restate that again as it was so similar.

No aggression intended. I understand that your example was intended as an edge case.

To your 2nd and 3rd paragraph - help me understand the distinction you are making - you find the proposal I added above too complex, but if i remove support for objects from the proposal such that it allows all types and just arrays (to support n-dimensional arrays) I more or less end up with what @edgar-bonet proposed... and with which it sounds you were quite happy with -- so it is just the inclusion of supporting objects in this spec that upset you?

No, my point is that @edgar-bonet's proposal is itself fundamentally too complex because semantically the only thing it communicates is n-dimensionality. It is semantically only as powerful as an n-dimensionality specifier (I'm fine with n-dimensional objects, btw). The problem that I have is that there is no reason to do a context-sensitive recursive parse in order to support n-dimensionality when there is a significantly simpler and easier method to support n-dimensionality.

The way I see this conversation structured is the following:

OP: "I have an n-dimensional container. It would be awesome for performance and for readability if I could pre-specify the n-dimensional container and then load all of it all at once with no intermediate markers"

(I agree so far)

Then: "Therefore, here's a complicated way we could completely redesign the parser to detect n-dimensional markers by splitting every container into a full-container schema specification and validation pass and then holding that schema in memory while we walk the data"

(Whoa! I have problems with this! Thats a radical change and incredibly complex)

Me: "Counter-proposal: Why don't we just specify n-dimensional markers directly? Seems easier to me"

That's basically where I still am now, because I haven't yet been shown any case where the 'pre-parsed schema' version provides any semantic expressive power over just a dimensions header, and I although I have seen (like your example) cases where the pre-parsed schema version provides some byte savings, I'm not convinced those byte savings are worth the dramatically increased complexity,

Steve132 commented 10 years ago

Another thing I'd like to point out is that it seems like at least part of the motivation for this is to allow STCs inside other STCs even if it doesn't provide a performance benefit. I just wanted to point out that that is not explicitly disallowed by my reading of the current spec. You can totally currently have an STC array of STC objects like

[[][#][i][3][$][{]             //an array of 3 dictionaries
    [#][i][2][$][i]            //first dictionary is STC dict of 2 8-bit ints
        [i][4][key1][6]
        [i][4][key2][10]
    [#][i][2][$][s]          
        [i][4][key1][i][3][abc]
        [i][4][key2][i][3][efg]
    [#][i][2][$][I]
        [i][4][key1][2334]
        [i][4][key2][2332]

This case can't benefit from increased performance anyway because of the dynamic length of the keys in the dict, so the schema parsing doesn't really help much over this case

ghost commented 10 years ago

No, my point is that @edgar-bonet's proposal is itself fundamentally too complex

Ahhh, that's where I was getting confused. We're on the same page now :)

Then: "Therefore, here's a complicated way we could completely redesign the parser to detect n-dimensional markers by splitting every container into a full-container schema specification and validation pass and then holding that schema in memory while we walk the data"

Haha, point taken - I guess you probably shouldn't read #48 then...

The radical flexibility proposed a little bit in this discussion and formalized in #48, which significantly more complex than just adding simple serial headers, does bring UBJSON more inline with the realities of JSON itself (as far as functionality) and also provides some significant space wins in certain cases.

I'm not convinced those byte savings are worth the dramatically increased complexity,

This is the big hang-up and I have a feeling arguing this will be a lot like arguing religion... so here's my thinking...

Currently, as of Draft 11, UBJSON imposes (unnecessary?) restrictions on you if you want to leverage strongly typed containers - in that those containers cannot contain other containers or mixed types... there isn't a strong technical reason for this as we've solved this 3 different ways (2 in this thread and 1 in #48 )
Using what is proposed in this discussion solves one limitation of STCs but imposes one more - optimization goes out the window in the case of mixed-type containers.
(I can already hear you...) Yes I agree this is not a common case, but purely at an academic level it is a limitation we can design out of the specification quite elegantly with what is proposed in #48
The complexity being proposed is not any worse than a spec like Smile, Protobufs, Thrift, etc... yes it is more complex than original UBJSON, but the space and performance wins can potentially be significant and I think the one-time cost of implementing the spec can lead to long term wins over time as libraries mature.
Every data format that has ever been successful has had an inherent flexibility to it - an ability to grow well beyond it's original intent for better (or worse) -- think of XML and even JSON plus all the child specifications (like JSON schema) that have gone along with it... this is a requirement of a format if it wants to be successful. I think this flexibility being proposed, while complex, is worth that cost. I think this flexibility to allow UBJSON to be anything to anybody, is what gives it 10-year legs. This is a big motivator as to why I'm taking years to formalize 1.0, because I don't want there to be a 1.1 or 2.0 -- just like JSON, it would freeze in time as JSON's better looking and sleeker brother :)

ghost commented 10 years ago

@Steve132 Responding to your last post - the STC spec currently allows the [$] type marker to only be one of the Value Types, so no containers-of-containers.

Steve132 commented 10 years ago

UBJSON imposes (unnecessary?) restrictions on you if you want to leverage strongly typed containers - in that those containers cannot contain other containers or mixed types...

I don't agree that this is written explicitly in the spec. I certainly didn't see it that way and I didn't implement it that way in ubj. I just re-called the type-parsing code but saved which type it was, including other containers. Worked fine and had fewer special cases

the STC spec currently allows the [$] type marker to only be one of the Value Types, so no containers-of-containers.

Ah, ok. So, all you have to do is remove that restriction. The "containers of containers" use case is then solved. No schema parsing or pre-recursive validating needed.

If you relaxed that restriction, (and ubj shows that you can, I didn't even realize that restriction was there when I implemented it...leaving it unrestricted made for prettier code so it didn't occur to me)

That would mean that you could choose to encode the "Draft 11" version of your example as

[{][#][i][3][$][[]
    [i][6][array1][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [i][6][array2][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [i][6][array3][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]

in the current draft 11. This is actually even a little bit bigger than current json, but for sizes of arrays larger than 3 then the STC savings get significant, even in draft 11, without having to change anything at all. (somewhat furthering my argument that the example given is somewhat of a pathological case, because the Draft 11 with STCs starts to get smaller than the Draft 11 without STCs for inner arrays of size 4, literally the next size up)

In summary, my argument is that the use case of "Using STCs inside other STCs" is already supported by draft 11 (if you remove that restriction which I didn't notice before) and already provides size benefits except for sizes <= 3, so that use case doesn't need to be addressed,

And the use case of "efficient and simple n-dimensionality" is not supported by draft 11, but would be easily added by adding a @n specifier followed by n integers in the container header parser code, which is how most n-dimensionality libraries already work and would be a billion times simpler.

ghost commented 10 years ago

In summary, my argument is that the use case of "Using STCs inside other STCs" is already supported by draft 11 (if you remove that restriction which I didn't notice before) and already provides size benefits except for sizes <= 3, so that use case doesn't need to be addressed,

Heh, just to be clear, it's not officially supported in Draft 11, but certainly appreciate that you have a proof of concept that shows how trivial it would to add support for it.

Maybe we break the work into two pieces:

Draft 12 (proposed)

Remove restriction of STC containing STC.
Modify definition of STC to be COUNT-TYPE (not TYPE-COUNT) as requested by @edgar-bonet to make the grammar cleaner.

RESULT: No major change to the spec; no major increase in size/performance.

Draft 13 (proposed)

Discuss the generalization of STC headers to be allowed to be used anywhere inside a container, leading to potentially huge space savings in select cases (e.g. scientific data) - #48

RESULT: Significant change to parsing/generating of STCs; potentially huge win in data size and parsing performance in special cases.

Thoughts?

Steve132 commented 10 years ago

Regarding your points:

Proposal 48 seems entirely unnecessary to me as a way of solving the intended problem. The "problem" is that STCs can't contain containers. As I said, as an implementer I implemented draft 11 UNAWARE of this restriction and my code gracefully handles it easily. This shows that the restriction is entirely cosmetic. If the purpose of proposal 48 is to solve this problem, the text of proposal 48 should contain the following string only: "Proposal: remove the restriction that the marker following a [$] must be a value type". Solved.

Point 2: I'm not sure I understand how relevant this is, as far as I understand it neither STCs nor this Schema proposal can be used with mixed-type containers. Perhaps I'm confused?

Point 3: It's a limitation we can design out of the spec by simply deleting the offending restriction from the text of the spec. No other changes needed :)

Point 4 and 5 can be used to justify literally any change. If I proposed that we included a variant of the JPEG compressor in order to have significant savings in the use case of sending binary-encoded bitmap data in 2-D arrays, I could easily justify it by saying "Well in certain cases the space savings are significant and we expect changes to happen over time and things mature etc" That wouldn't make my proposal a good one for UBJSON.

Regarding point 5 in particular, I think the whole point is that XML is bad because of how complex it got. All the schemas and XSLT and complicated parsing rules and interactions with utf-8 meant that in practice nobody had a conformant implementation and nobody actually liked using it anyway. When JSON came out everyone was happy BECAUSE it was so radically simple and threw away all of the things that came tacked on over time to make XML more "flexible". Now, JSON has some of that same cruft, but I don't know anyone that uses it those crufty parts.

I was under the impression that the primary reason to use UBJSON over something like Smile, or to was that UBJSON was different than smile: Low complexity, easy to parse, few corner cases or restrictions, no application-specific optimizations or data-types. I feel like part of the POINT when discussing UBJSON is that we are going to make sure that UBJSON is NOT smile and never BECOMES smile.

People use JSON because its NOT XML.

Steve132 commented 10 years ago

So, I totally agree with your draft 12 btw.

Draft 13 then would be where we push the issue of how best to handle the 'scientific data' header. My argument is that the best way for scientific data is just the n-d array specification with @n and dimensions, because it leads to super fast super efficient code that maps well to how scientific data libraries currently work, and (as someone who wants to use ubjson for scientific data) it covers all the use cases I can foresee.

But discussing whether to do the schema version of the n-d array or my version would be best pushed to draft 13 I agree, and is a seperate issue

ghost commented 10 years ago

The "problem" is that STCs can't contain containers.

It's more than that - as you pointed out, that limitation is easy to solve - the limitation of changing your data model to match UBJSON's limitations is still there... for example, consider an array with the values:

[1, 2, 3, 300, 301, 302, 1.23, 2.34, 3.45]

You cannot model that in UBJSON as a single (optimized) array - you would either need to break it into 3 arrays (problem: changing your data to meet the limitations of a spec) or use a single unoptimized array (problem: no space/performance savings).

If the purpose of proposal 48 is to solve this problem, the text of proposal 48 should contain the following string only: "Proposal: remove the restriction that the marker following a [$] must be a value type". Solved.

Proposal #48 is trying to solve much more than that. One of the biggest differences is the ability to define the STC headers for all data upfront if your data allows it and avoiding repeating them inside of the payload.

Point 2: I'm not sure I understand how relevant this is, as far as I understand it neither STCs nor this Schema proposal can be used with mixed-type containers. Perhaps I'm confused?

You are right, STC and what is being discussed in this issue cannot - that is why #48 was written as a super-solution to all of these limitations. If you get a chance to read it in detail, let me know your thoughts (it reads pretty quickly, just big examples)

Point 4 and 5 can be used to justify literally any change.

To your larger point ("let's be realistic with what we perceive as wins") I agree; I am perceiving a fairly common scenario (especially in geo-spatial data and scientific data) where allowing a singular header definition of containers can lead to a significant performance win.

I didn't think I was being too unreasonable here; any data that consists of large collections of small containers benefits from this change; the smaller the container, the bigger the savings (up to something like a 90% size reduction).

Right? I hope I'm not being obtuse here...

I think the whole point is that XML is bad because of how complex it got.

Uh oh... religious discussion :)

Now, JSON has some of that same cruft, but I don't know anyone that uses it.

I am not deaf to your arguments, trust me, I hear them, I just think the sample size of this group is uselessly small to pretend we know exactly what JHL would do with UBJSON, or Valve is doing with it in their OpenGL debugger or Smart Meets social network in France is doing with it or what CouchDB might do with it as a backing data storage format.

(as examples of why I think the argument of 'I don't know anybody that...' is valueless or just not needed here)

People use JSON because its NOT XML.

I understand - again, I'm not suggesting this change be made out of spite for the users of UBJSON; I am perceiving a bit benefit to it that makes it worth it.

So, I totally agree with your draft 12 btw.

But discussing whether to do the schema version of the n-d array or my version would be best pushed to draft 13 I agree, and is a seperate issue

Ok - would like to give other's a chance to weigh in before any decisions are made. Thanks for the discussion!

meisme commented 10 years ago

To me (and my code) normal, count-optimised and type-optimised containers are fundamentally different, and I appreciate being able to tell them apart by the first byte. I would not like to see # marker placed before $ marker for type-optimised containers if draft 12 ends up being chosen.

ghost commented 10 years ago

@meisme can you give a little more detail?

meisme commented 10 years ago

Well, as it currently stands I've got more or less

Parser::_parse_array() {
    switch (_in.peek()) {
    case '#':
        _parse_array_countoptimized();
        break;
    case '$':
        _parse_array_typeoptimized();
        break;
    default:
        _parse_array_normal();
    }
}

As you can see it's easy to implement the delegate function. The _parse_array_normal and _parse_array_countoptimized functions are very similar, with countoptimized basically just priming the memory and calling the other. _parse_typeoptimized on the other hand is very different featuring performance optimizations. It's very nice to be able to distinguish this right away.

If # were moved first, the logic would have to be as below, which is more ... spaghetti.

Parser::_parse_array() {
    if (_in.peek() == '#') {
        _parse_array_count();
    } else {
        _parse_array_normal();
    }
}

Parser::_parse_array_count() {
    int len = _parse_count();
    if (_in.peek() == '$') {
        _parse_array_typeoptimized(len);
    } else {
        // prime memory
        _parse_array_normal(len);
    }
}

I mean... sure, this works too. But it's uglier.

edgar-bonet commented 10 years ago

I personally do not see the point of the count optimized arrays. You can not really prime the memory since you do not know the size of your objects. Why not drop it from the spec?

Steve132 commented 10 years ago

You can not really prime the memory since you do not know the size of your objects.

Yes, you can. You can't prime ALL the memory all the way down, of course, but you can prime the size of the array you are using to store the children.

See https://github.com/Steve132/ubj/blob/master/ubjr.c#L358

In UBJ, a count-optimized array of mixed children still travels along the fast path and has minimal memory allocations, because the mixed type is implemented as a union. Because the containers only store pointers, then the union size is actually fairly small and has constant size, then if you know the size in advance you can pre-allocate N instances of the union and then populate it dynamically. This is the same code path as it is for the fixed-size types and uses O(1) memory allocations instead of O(log(n)) allocations for the unknown case. It's also simpler.

Especially if none of the mixed cases are containers: In that case, then you only need to do a single allocation for the entire array.

ghost commented 10 years ago

@meisme thank you for the detail -- I would appreciate it if you and @edgar-bonet could come to a consensus on the order of # and $ you would like to see as you are both requesting opposite things for good reasons.

@edgar-bonet count-optimized opens the doors to potentially very highly tuned implementations of UBJSON parsing -- for example parsing same-typed arrays of known size (particularly important in binary cases).