Make container header more structured and consistent in behavior

ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.

115 stars 12 forks source link

Make container header more structured and consistent in behavior #51

Open ghost opened 10 years ago

ghost commented 10 years ago

(Inspired by #50)

This would be an effort to make the definitions of Array and Object singular with no optional parameters as well as simplify the definition.

The TYPE and LENGTH values can be Z (NULL) to indicate they are not provided, but the point is that at least the first 2 bytes of every container header would be TYPE and LENGTH information.

Type-Length-Value Definition: ([|{)(#)(TYPE_MARKER)(NUMERIC_INT_VALUE)(...payload...)

([|{) - Marks the beginning of an array or object container.
# - Marks the header of the container (using an explicit marker because another proposal may be forthcoming to allow this marker to repeat within the context of containers)
TYPE_MARKER - Z or any marker that represents a Valid Type
NUMERIC_INT_VALUE - 'Z' or any integer numeric value
Notes
If TYPE_MARKER represents a numeric type, NUMERIC_INT_VALUE must not be null (as we wouldn't know when to stop parsing)
If TYPE and LENGTH are not provided ([#][Z][Z]) then the 3-byte block can be omitted since this is a standard array or object container.
Array Example

[[][#][Z][Z]
  [i][1]
  [S][i][5][hello]
[]]

[[]
  [i][1]
  [S][i][5][hello]
[]]

[[][#][Z][i][3]
  [i][1]
  [C][a]
  [S][i][3][bob]

[[][#][S][Z]
  [i][1][a]
  [i][1][b]
  [i][1][c]
[]]

[[][#][S][i][4]
  [i][3][foo]
  [i][3][bar]
  [i][3][bob]
  [i][3][joe]

Object Example

[{][#][Z][Z]
  [i][4][name][S][i][3][bob]
  [i][2][id][L][123456789]
  [i][5][admin][T]
[}]

[{]
  [i][4][name][S][i][3][bob]
  [i][2][id][L][123456789]
  [i][5][admin][T]
[}]

[{][#][Z][i][3]
  [i][4][name][S][i][3][bob]
  [i][2][id][L][123456789]
  [i][5][admin][T]

[{][#][S][Z]
  [i][4][name][i][3][bob]
  [i][8][password][i][8][pass1234]
  [i][4][role][i][5][admin]
[}]

[{][#][S][i][3]
  [i][4][name][i][3][bob]
  [i][8][password][i][8][pass1234]
  [i][4][role][i][5][admin]

Pro

Makes parsing the container types more consistent as they always have the same shape - you always know peeking at the 1st byte gives you the type and peeking at the second gives you an indication if a size should be parsed or not.
Con
For very small collections (e.g. array of 1-2 elements) the 2 extra bytes per container header will increase the size of the representation.

kxepal commented 10 years ago

Two issues:

Optional closing marker. Better to require it always to insure that container data is over. This will save you from eating data from your neighbourhoods:

(edited for readability by Riyad)

[[][Z][i][2]
  [[][Z][i][3]
    [T]
    [T]
    [[][Z][Z]
      [F]
    []]

Which array causes the issue here: first or second? Happy debugging. Closing marker signals parser that elements of the related containers are over and it's time to check all of them has been received or not - you'll always know the exactly source of the problem.

Container header isn't explicitly defined by the marker. Without own marker this solution closes the doors for #50 feature of having containers with mixed schemas - this problem would be hard to ignore.

For very small collections (e.g. array of 1-2 elements) the 2 extra bytes per container header will increase the size of the representation.

This isn't big problem since we have other "unoptimal" types like a strings which causes: they also has "significant" overhead for short texts.

ghost commented 10 years ago

[QUESTION] I can always get onboard with "safer than sorry", but this is the first time this has come up - we've omitted closing characters from containers since UBJSON's inception; any particular reason this is coming up now? Also to your example specifically, the first array causes the problem because it stated that it had 2 elements but only contains 1 (the 3-element array) -- maybe I am missing your point?
[AGREE] Ahh, you mean using # to mark the header so it can potentially be repeated? If so, I agree, I made the change above.
[DISAGREE] I don't agree that it isn't a problem just because we already have types that introduce the same unoptimization -- just because A is bad doesn't excuse B from being bad too :) -- BUT, I want to see how interested people are in this proposal before we decide to make this change.

kxepal commented 10 years ago

[QUESTION] I can always get onboard with "safer than sorry", but this is the first time this has come up - we've omitted closing characters from containers since UBJSON's inception; any particular reason this is coming up now? Also to your example specifically, the first array causes the problem because it stated that it had 2 elements but only contains 1 (the 3-element array) -- maybe I am missing your point?

It's hard to say for sure, since may be first array is correct and the second one miss the third element. You cannot provide the meaningful and helpful error report for such cases. In contrast, every modern JSON parses are able to show you direct source of the problems with byte precision.

The good format shouldn't be just compact, fast and simple, it also should have to provide ability to inspect and locate the issues caused by malformed data (real world isn't perfect). Here you can point me on XML where it doesn't simply to guess which tag miss closing pair, but I believe we're better, right? (;

[AGREE] Ahh, you mean using # to mark the header so it can potentially be repeated? If so, I agree, I made the change above.

Thanks!

[DISAGREE] I don't agree that it isn't a problem just because we already have types that introduce the same unoptimization -- just because A is bad doesn't excuse B from being bad too :) -- BUT, I want to see how interested people are in this proposal before we decide to make this change.

Agreed (: However, since you'd introduced # marker the problem is solved by itself: [#][Z][Z] has no meaning and MAY be omitted (as per #50) (;

ghost commented 10 years ago

The good format shouldn't be just compact, fast and simple, it also should have to provide ability to inspect and locate the issues caused by malformed data (real world isn't perfect).

I totally agree.

It's hard to say for sure, since may be first array is correct and the second one miss the third element.

Alex, my apologies I am just not seeing this confusion, here is the example below:

1. [[][Z][i][2]
2.   [[][Z][i][3]
3.     [T]
4.     [T]
5.     [[][Z][Z]
6.       [F]
7.    []]

For a parsing keep track of scope, it will:

Line 1, enter array scope 1
Line 2, enter array scope 2
Line 3-5, parse 3 elements
Line 5, enter array scope 3
Line 6, parse 1 element
Line 7, close array scope 3
Line 7, closed array scope 2 (since 3 elements were completely parsed)
Line 8, THROW EXCEPTION, "expected additional element @ byte offset 16"

I know I must be missing your point and absolutely agree with the fundamental of what you are proposing (safer structure, better/accurate error reporting, real data in real world is sloppy and needs this) -- I just want to understand where the gap is before I make the change.

Agreed (: However, since you'd introduced # marker the problem is solved by itself: [#][Z][Z] has no meaning and MAY be omitted (as per #50) (;

Ah! I actually missed this detail on #50, that in the case of #ZZ, the 3-byte block can be omitted. I'll update the proposal above.

Steve132 commented 10 years ago

I don't think this simplifies anything. You still have to read the header dynamically based on its content. You still have to have logic to decide whether or not its a static typed or constant-length or neither. You still have to do dynamic allocation based on what you find. This doesn't seem to make decoder logic simpler, and it has a size cost.

kxepal commented 10 years ago

It's hard to say for sure, since may be first array is correct and the second one miss the third element.

Alex, my apologies I am just not seeing this confusion, here is the example below:

... snip ...

I know I must be missing your point and absolutely agree with the fundamental of what you are proposing (safer structure, better/accurate error reporting, real data in real world is sloppy and needs this) -- I just want to understand where the gap is before I make the change.

Ok, I'll try to make it clean. So here is our case:

> 1. [[][Z][i][2]
> 2.   [[][Z][i][3]
> 3.     [T]
> 4.     [T]
> 5.     [[][Z][Z]
> 6.       [F]
> 7.    []]

Of course, this is how the parser will see this data. And you're right - from the technical side it cannot see it in other way. However, this makes it blind to see the next situation:

> 1. [[][Z][i][2]
> 2.   [[][Z][i][3]
> 3.     [T]
> 4.     [T]
> 5.   [[][Z][Z]
> 6.     [F]
> 7.   []]

Parser will not able to see this issue - it will follow the logic you described and assert with "expected additional element @ byte offset 16". But it will be wrong since the actual issue is @ byte offset 11. With closing markers the parser will never get fooled:

1. [[][Z][i][2]
2.   [[][Z][i][3]
3.     [T]
4.     [T]
5.   []]
6.   [[][Z][Z]
7.     [F]
8.   []]
9. []]

And it will always raise an exception for the right offset.

You probably aware about common XML issue: it's hard to guess which end tag is missed for deeply nested structure: you have to apply some heuristic to understand the data and guess the issue location, but it's hard. Just closing tag or just element counter cannot overcome this issue alone, but together they makes it found with easy.

kxepal commented 10 years ago

@Steve132 To answer on your questions:

I don't think this simplifies anything. You still have to read the header dynamically based on its content.

The very first thing it simplifies is the code - same rules for each, no exceptions. The simple code tends to be clean, consistent and predictable - JIT will be happy to optimize it while smart compiler could predict more operations and save the time. And if implementation is clean, then so format is. Reading header in single shot is seductive, but it's not the relevant part of decoding process.

You still have to have logic to decide whether or not its a static typed or constant-length or neither

On reading the header you set up the decoding context. The state machine which you'll feed with payload and which will emit the result when the object is done. I believe, you're already have the same for handling plain old containers, but as a bit more simple one: just read till the end marker.

You still have to do dynamic allocation based on what you find.

That's roughly impossible to avoid. Different string sizes, different numbers (huge ones also known and used as decimals), different containers. However, for some of these types you can preallocate base structure to handle the rest.

This doesn't seem to make decoder logic simpler, and it has a size cost.

As I said, it standardizes the way how the each part of the format works which leads to the clean implementation whatever language you uses. But indeed, all things has they price and the few bytes overhead is worth it. The primary use case of optimized containers will be to fix the array-of-object problem: it's very common, well know and optimization of it will gain a huge size benefit so you can easily close your eyes on few more bytes which you could save - in comparison with the overall profit they are irrelevant.

breese commented 10 years ago

Another advantage of making the closing marker mandatory is that the parser is not required to carry mutable state around. When the parser encounters the count, it can be used to allocate storage (where feasible) and then be discarded without affecting the subsequent parsing.

Steve132 commented 10 years ago

The very first thing it simplifies is the code - same rules for each, no exceptions. The simple code tends to be clean, consistent and predictable.

As I said, it standardizes the way how the each part of the format works which leads to the clean implementation whatever language you uses

As opposed to the status quo? The status quo is fine and simple and standardized already

Here is the code for the current system.

static inline void priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);
    if (nextchar == '$')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar = priv_ubjr_context_peek(ctx);
    }
    else
    {
        *typout = UBJ_MIXED;
    }

    if (nextchar == '#')
    {
        *sizeout = priv_ubjw_read_integer(ctx);
    }
    else
    {
        *sizeout = 0;
    }
}

This is the implementation as it would be for this proposal:

static inline void priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);
    if (nextchar == '#')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar=priv_ubjr_context_peek(ctx);
        if(nextchar != 'Z')
        {
            *sizeout = priv_ubjr_read_integer(ctx);
        }
        else
        {
            priv_ubjr_context_getc(ctx);
        }
    }
    else
    {
        *typout = UBJ_MIXED;
        *sizeout=0;
    }
}

It's not significantly simpler, nor does it really simplify any corner cases. I don't really see it as 'cleaner' although I guess that's subjective. I don't see how it would be inherently easier for a compiler or JIT runtime to 'optimize'.

However, maybe its possible that you are thinking of some other way to implement the two functions that more clearly shows the difference that you are envisioning. Can you show me with code?

ghost commented 10 years ago

@kxepal Thank you for the clarification - I see your point now, the reason I was missing it is that I always assumed in your example that the internal [[] was meant to be a member of the 2nd array (and it wasn't - you were meaning to show that the second array did NOT have 3 elements) -- so by clarifying your point, you made me realize I was parsing this example incorrectly the entire time... thusly proving your point that closing chars are important :)

@kxepal [QUESTION] You mentioned "The primary use case of optimized containers will be to fix the array-of-object problem: it's very common, well know and optimization of it will gain a huge size benefit" -- what did you mean here? I don't see a huge size benefit using this in an array-of-objects... you'd have one header with a type of [{] -- maybe I'm missing something?

Miosss commented 9 years ago

@thebuzzmedia @kxepal

Talk about STC is split across so many issues, both open and closed, that I do not know where to write my input.

After reading most of the discussions I have my own thoughts. In the beginning it is important to note, that in strongly-typed languages, the JSON objects are the biggest problem (and on the other hand one of the best things for dynamic languages). In C++ they could probably be modeled best as something like this:

std::map<std::string, some_value_type>

What is important to note is that objects, are suitable and most often use, I think, for mixed data, for example:

{
"config": {
"mix": 1,
"title": "Some title",
"best_caption": "Nice trick!",
"last_id": 123986
},
"ip": "127.0.0.1"
}

Even optimized STC construct makes only half good, as each pair's key is still present as a string. I do not see sense in constucting objects like:

{
"data1": 1,
"data2": 2,
...

Of course it is allowed in JSON grammar, but STC idea in general is focused on optimizing common uses (I think) While preserving full logic consistency with JSON. Therefore we should focus on those areas that are used often and make sense for decent, technology aware, users. Construct that seems much better optimizable is array such as:

[1, 2, 3, 4, 5]

And this is probably the case where STC make real sense. Of course following is valid also:

[1, "two", 3, "four, 5.67]

But again, if we do focus on edge-cases than entire STC idea can be thrown away. While we do not want this, lets conclude with some decisions.

We should consider wise use of those containers (unwise usage we still represent as normal mixed containers). Therefore COUNT and TYPE seem as a good idea. I think, that such pairs should be allowed multiple times, more as a SECTION description, than entire container. It would allow for optimization of arrays of consecutive sections of one type. For objects it could be more - inteligent encoder could group key-value pairs by value type and use this section headers (according to JSON, objects are unordered).

Pros:

enables optimized headers for mixed arrays (but still only for "proper" examples)
thus reduces size of a container
allows intelligent encoders to perform values alignment

Cons:

slightly more complicated decoder (has to check whether next element is '$', '#' or type desc), but it does not seem to be much

I do not treat TLV as big parser relief, as @Steve132 does, because I find it not so elegant to cast, lets say, next 100 bytes as array of floats. It complicates the situation in which ubjson is ill formed, is not so portable (endianess issues) etc. For me it is easier in that sense, that decoder parses only next 100 bytes as floats, without reading the type each time.

I have not yet come to conclusion, whether multi dimensional type specification is better or worse in the end... I shall meditate on this more.

And because STC is invention of ubjson only, I think that it is reasonable to put all responsibility for choosing between standard. mixed container and STC, to encoder. More advanced encoder could perhaps analyze entire written object or consecutive values in array, and decide which container style to use. Both are interoperable, so the decision comes mostly to size of created ubjson.

ghost commented 9 years ago

@Miosss I know I am replying to this late - but I agree with your points made.

xcube16 commented 8 years ago

More common to see mixed arrays like this: ["a", 0, "b", 1, "c", 3, "d", 4, "e", 5,...] Kinda rare to see them like this: ["a", "b", "c", 1, 2, 3,...]

In the first case using this optimization would be disastrous! And in the second case we might save a few bytes, but is it really worth it?

ubjson / universal-binary-json

Make container header more structured and consistent in behavior #51

Notes

Array Example

Object Example

Pro

Con