Proposal for a generic, STC like, container-free optimization

adilbaig commented 11 years ago

I am proposing a generic solution here to STC, that is not bound to containers, works across an entire UBJSON document and is fully JSON compatible.

The change is that types are not specifically bound to values, but an indication that all further data is of this type. This basically shifts the parser from being a TLV parser to a T followed by multiple LVs or Vs. Ex:

[i][5]
[i][10]
[i][1000]
[d][1.234]

Becomes :

[i][5]
   [10]
   [1000]
[d][1.234]

Any time we want to "change" the type, we specify a new one and all further types are assumed to be such. Ex:

[i][5]
   [10]
   [1000]
[d][1.234]
   [2.34]
   [3.4]

It also works for string types :

[S][i][5][login]
   [i][7][octocat]
   [i][4][name]
   [i][16][monalisa octocat]

In the worst case scenario (i.e : when values are always different from their previous value) it converts down to valid UBJSON as currently defined. Ofcourse, this optimization applies only to value types with payloads as it offers no advantages to null, noop and bool types.

The proposal also overflows into containers, both arrays and objects. This means that if you start with a TYPE and reach a container it continues with that type, unless the container defines it separately. This makes it slightly easier for driver implementation as you dont need a special if condition to check for containers, just continue with what you have.

//{ name : "octocat", id : 100, followers : 20 }
[{]
    [S][i][4][name][i][7][octocat]
       [i][4][id][i][100]
    [S][i][4][followers][i][20]
[}]

Here is a JSON doc converted :

{"name":"octocat", tags:["java","python","php"]} //48 bytes

// Current UBJSON Draft output
[{]
    [S][i][4][name][S][i][7][octocat]
    [S][i][4][tags]
       [[]
          [S][i][4][java]
          [S][i][6][python]
          [S][i][3][php]
       []]
[}]
// 49 bytes

// Enhanced output
[{]
    [S][i][4][name][i][7][octocat]
       [i][4][tags]
       [[]
          [i][4][java]
          [i][6][python]
          [i][3][php]
       []]
[}]
// 44 bytes

As you can imagine, it doubles up for STC.

[[]
    [i][1][0][1][0][1][0][1][0] ...
[]]

Advantages :

Minimal changes to existing spec.
Not limited to containers.
Value types can be added/updated with no change.
Well suited for large, homogenous, contigous data (no need to introduce any complexity for binary data).
Uncompromised fidelity with JSON.
Easy to parse.
Better compression than before.
Cons :
The ordering of the data can affect the file size, which is a little un-intuitive. This is not a serious problem considering the benefits.

Your thoughts?

breese commented 11 years ago

@adilbaig This will fail for any value that corresponds to a valid UBJSON token. For instance:

[i][5]
   [105]
   [1000]

will be interpreted as

[i][5]
[i][1000]

because 'i' has the ASCII value of 105.

I think that STC is down to two possible solutions. Either the length (or count) must be specified before the data, or there must be an escape mechanism for value that corresponds to tokens (a bit like the reverse solidus in strings.)

adilbaig commented 11 years ago

@breese - Good point, and i agree there's only a couple of ways to solve this. It's tricky to propose a solution that is simple and degrades gracefully to the TLV data types we currently have without introducing additional logic/complexity for STC.

ghost commented 11 years ago

@adilbaig I love the way you are thinking around this solution; the elegance is very attractive.

You mentioned "there are a couple of ways to solve this" -- how would you propose?

adilbaig commented 11 years ago

Here's the simplest one i could think of. I propose introducing TYPE GROUPS, i.e: start and end markers which contain UBJSON elements all of the same type.

The start marker, <, will be followed by a type, then followed by any number of data assumed to be of that type, until the end marker > is received.

[<][i][5]
   [10]
   [1000]
[>]
[<][d][1.234]
   [2.34]
   [3.4]
[>]

And, here's another document :

{"name":"octocat", tags:["java","python","php","perl"]} //55 bytes

// Current UBJSON Draft output
[{]
    [S][i][4][name][S][i][7][octocat]
    [S][i][4][tags]
       [[]
          [S][i][4][java]
          [S][i][6][python]
          [S][i][3][php]
          [S][i][4][perl]
       []]
[}]
// 56 bytes

// Enhanced output
[{]
    [<][S][i][4][name][i][7][octocat]
       [i][4][tags]
       [[]
          [i][4][java]
          [i][6][python]
          [i][3][php]
          [i][4][perl]
       []]
    [>]
[}]
// 52 bytes

Note that chevrons do not have to start and end with container boundaries. They can overflow and change any time, ex :

{ ids : ["1234", 78, 90, 91], ids2 : [901, 1.234, 9.768]}

[{]
    [<][S][i][3][ids]
       [[]
          [i][4][1234]
          [>][<][i] //type change to int
          [78]
          [90]
          [91]
       []]
    [>]
    [S][i][4][ids2] //notice we didn't specify a TYPE GROUP for one string
    [[]
          [i][901] // or one int
          [<][f] //But now, we do
          [1.234]
          [9.768]
          [>] //a start chevron must end explicitly
       []]
[}]

Also notice in the above example, that at ids2 we stopped using chevrons until we came across a group of floats. Type groups are not an all or nothing proposition, they can be used for parts of a document or ignored completely during encoding.

Pros :

Chevrons are an optional optimization.
In case it is not used, any UBJSON document that uses the current TLV style parsing will work.
Parsing is still relatively simple, as there is no concept of scopes in type groups. The current type continues until an end chevron is reached.

Cons :

Encoding optimally requires look ahead to see of the next type is the same as previous and hence it should use a chevron. This is a highly desirable encoding optimization, not a rule.

breese commented 11 years ago

@adilbaig I am sorry to be the nay-sayer once more, but > also has an ASCII value (62). You still need a count or an escape mechanism.

adilbaig commented 11 years ago

@breese - No need to apologize :) . I'm obviously hurrying this.

I like the idea of a count except that it takes away from the streaming abilities of containers. I need to think more on the escape mechanism (my proposal for the type groups was based on this rationale.) Any suggestions?

ghost commented 11 years ago

@adilbaig Just keep brainstorming and see what sticks -- we have been discussing around this idea for over a year, it is a hard nut to crack, these all ended up blending into each other:

Chunked Containers #10
Strongly Typed Containers #13
Type Groups #25

This is something we all keep coming back to in our own way; I think this is a sign that we need/want this to happen and I think the brainstorming is the only way to find it; so love the ideas, keep massaging and keep iterating.

We'll crack this eventually! :)

kxepal commented 11 years ago

Inspired by HTTP chunked transfer logic. I'll use new pair of markers ( and ) to split of Optimized Array variant from regular one. Inline comments are started with #.

[(]           # Starting Opmitized Array
    [i][i][4]    # Signing, that next data is 4 `int8` values
    [1]
    [2]
    [3]
    [4]       # We'd read the last announced value, what's going next?
    [S][i][i][3] # Oh, three strings with `int8` size ahead!
    [3][foo]  # String length + value
    [3][bar]
    [3][baz]
    [C][i][6]    # Time for chars
    [A]
    [B]
    [C]
    [D]
    [E]
    [F]
[)]           # We'd got everything that we can.

If it was regular array, his size was:

2 - for markers
2 * 4 - for first integers
(3 + 3) * 3 - next three strings
2 * 3 - for chars
Total: 2 + (2 * 4) + (3 + 3) * 3 + 2 * 6 = 40 bytes

With new shiny optimized array:

2 - for markers
3 - for first chunk
4 - for int8 values
4 - for second chunk
(1 + 3) * 3 - for string values
3 - for last chunk
6 - for chars
Total: 2 + 3 + 4 + 4 + (1 + 3) * 3 + 3 + 6 = 34 bytes

We'd saved...6 bytes for this small array object. Not much, since we're only saving for repeated markers so it's about +1 byte per numbers/chars and +2 bytes per string/huge values with in array and -2 byte per type fragment. There is also no profit if fragment contains 3 values and the worst case is it contains 1 or 2 values:

[(]           
    [i][i][1]
    [1]
    [C][i][1]    
    [A]
    [i][i][1]
    [2]
    [S][i][i][1] 
    [3][foo]
    [i][i][2]
    [3]
    [4]       
    [S][i][i][1] 
    [3][bar]
    [C][i][1]    
    [B]       
    [S][i][i][1] 
    [3][baz]
    [C][i][4]    
    [C]
    [D]
    [E]
    [F]
[)]           

Total: 54 bytes (sic!)

Best case for Optimized Array is a STC where all values has same UBJSON type. Common use case is when data has low type fragmentation. In real world most array cases tries to prefer some single data type, so STC will rock for most cases, while Optimized Array may help with large sequences of numbers, especially if they are quite sorted and doesn't fits some singe UBJSON type (e.g. range from 0 till 100500: we hit's three int* types there, but for JSON they are all just a numbers).

P.S. Just realized that this is issue #10 done right (: Anyway, thoughts? UPD. I forgot fragment size marker, numbers are recalculated.

ghost commented 11 years ago

@kxepal so basically a combination of sized-chunked-STC-like containers?

Do you see this as a replacement to STC, or a compliment to Array's and we STILL do STC because they are unsized like Arrays?

kxepal commented 11 years ago

@thebuzzmedia

Yes, it looks as hybrid of STC and chunked containers. No, it's not a replacement of STC. STC looks as an special case of such optimized array. In other words: Array with data of mixed types => Optimized Array Array with single typed data => STC

ghost commented 11 years ago

I think it has a bit more parsing complexity around it, but otherwise think it is elegant.

I could technically stick binary (chunks of BYTE) AND non-binary data in this same Optimized Array couldn't I?

kxepal commented 11 years ago

In any way, I don't like to stand for my approach since I feel it tries to optimize some special cases suffering processing speed while it also has one big flaw as data fragmentation. But hope it may help us invent something new and really helpful.

kxepal commented 11 years ago

@thebuzzmedia

I could technically stick binary (chunks of BYTE) AND non-binary data in this same Optimized Array couldn't I?

Technically, you can since it still same array where data is grouped by their types. Few groups with a lot of members gives more profit against regular array, but makes him looks more close to STC.

ghost commented 11 years ago

@adilbaig We are discussing a minor modification to your idea here over here: https://github.com/thebuzzmedia/universal-binary-json/issues/27#issuecomment-18500979

If you have a chance to weight in on your thoughts I'd appreciate it.

ghost commented 10 years ago

@adilbaig - I want to let you know that I still like this original proposal.

I am closing this bug as WONTFIX because all the ideas for optimized containers/binary support/etc. were distilled down into what is now defined here with Draft 10 - http://ubjson.org/type-reference/container-types/#optimized-format

adilbaig commented 10 years ago

@thebuzzmedia - Thx. It's looking good.

ubjson / universal-binary-json