ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Allow strongly typed arrays to contain strongly typed arrays #43

Closed edgar-bonet closed 10 years ago

edgar-bonet commented 10 years ago

Hi!

The current (2014-05-23) version of the specification seems to only allow primitive types (called “value” types) inside strongly typed arrays. As an enhancement, I am requesting that strongly typed arrays be acceptable as elements of strongly typed arrays. This would provide optimization for multidimensional arrays, without resorting to any new syntactic construct.

Example

I have (I really do) a few huge JSON files, each holding, among other things, a thousand arrays that look like this:

[
    [1.23, 4.56],  // this is one datapoint
    [7.89, 0.12],  // another datapoint
    // a few thousand more datapoints...
]

I am looking for a more efficient, JSON-compatible, binary representation of the same data. “Unoptimized” UBJSON yields:

[[]
    [[] [d][1.23] [d][4.56] []]
    [[] [d][7.89] [d][0.12] []]
    // a few thousand more datapoints...
[]]

With this representation, each datapoint costs 8 bytes of data (float32 is enough precision for me), plus 4 bytes of overhead. That's 50% overhead, not so good.

The same array as “optimized” UBJSON is:

[[]
    [[][$][d][#][i][2] [1.23] [4.56]
    [[][$][d][#][i][2] [7.89] [0.12]
    // a few thousand more datapoints...
[]]

Now we have 8 bytes of data + 6 bytes of overhead per datapoint. That's 75% overhead, so the optimization is obviously not good for these small inner arrays.

Per the current proposal, the outer array can also be optimized, which yields the following “recursively optimized” UBJSON:

[[]                 // this is an array
    [$][[]          // of arrays
        [$][d]      // of float32
        [#][i][2]   // inner arrays have length 2
    [#][I][3200]    // outer array has length 3200
    [1.23] [4.56]   // first datapoint
    [7.89] [0.12]   // second datapoint
    // a few thousand more datapoints...

Now we have a really optimized layout with zero overhead.

And importantly, we are not introducing any new syntax, but only specifying that the “type marker” of a strongly typed array is:

[type of array] = [[][$][type of elements][#][length of array]

In the above example, the type marker of the outer array ("[$[$d#i<2>#I<3200>" for short) would be recursively parsed as:

level 0: [$ ┐                   ┌→ #I<3200> = array of length 3200
level 1:    └→ [$ ┐    ┌→ #i<2> ┘           = arrays of length 2
level 2:          └→ d ┘                    = float32

Regards, and thanks for the good job.

AnyCPU commented 10 years ago

Real use cases in production that uses new additions?

ghost commented 10 years ago

@AnyCPU can you clarify what you mean? which new additions specifically? (no one is using anything from this proposal or #48 because they haven't been ratified yet)

kxepal commented 10 years ago

Hey guys, I'd tried to follow the discussion for the whole day, but really I'm lost in about what problems you're trying to discuss and to which proposal points they are relates. How about to someone submit STC proposal as PR so we can discuss exactly by paragraphs and localize the issue without instead of apply them for the whole specification text?

ghost commented 10 years ago

@kxepal It is a long discussion, let me summarize:

  1. Most everyone is onboard with the idea of STCs containing other STCs - infact one of the implementations already support it by accident :)
  2. There is a mini-proposal to change the ordering of the $ and # markers but @meisme and @edgar-bonet have opposing justifications and haven't centered on this so I'm not changing anything.
  3. Taking this proposal and adding your idea of repeatability of a header is what spun off #48, but there is some discussion about that stuck in here as well.

Guys, please summarize anything I missed.

Overall it looks like opening the doors to having STCs contain STCs will most likely happen - and the shape of #48 because of @Steve132 recent feedback may morph a little... see my latest comment there - https://github.com/thebuzzmedia/universal-binary-json/issues/48#issuecomment-49348478

AnyCPU commented 10 years ago

I'm talking about from where they came? Before proposal to make new additions, they are has been probed in production?

As I understand that UBJSON is more storage data format instead of structural/container format.

With strongly typed array that have only one type, but that have multiple inner arrays, we still will have reduced parsing speed, because such array will be parsed in any cases, for example, when skipping or seeking some object.

Storage must be simple as possible.

ghost commented 10 years ago

With strongly typed array that have only one type, but that have multiple inner arrays, we still will have reduced parsing speed, because such array will be parsed in any cases, for example, when skipping or seeking some object.

Great point, in the DB storage format, you are right, if you are seeking on disk and hit a record that you want to seek past - the current form of UBJSON would need to be augmented with a DB-centric format that would have the complete structure size, in bytes, so it could be skipped over.

I don't think this is a shortcoming of UBJSON necessarily, I think this can be easily augmented with a DB-specific format similar to how systems like CouchDB augment standard JSON payloads with their "_name" params.

In short, what I'm saying is that regardless of what we figure out here, that doesn't necessarily make that problem unsolveable.

ghost commented 10 years ago

SUMMARY (July 19, 2014)

ghost commented 10 years ago

This discussion thread ended up being very valuable - a lot of good ideas in here and I'll continue to break out sub-ideas into other proposals but tie up this discussion for the sake of moving the spec forward.

ACCEPTED - http://ubjson.org/type-reference/container-types/#optimized-format

Added to the spec the ability to define a container marker as a value for $ - easiest change.

We can pickup the discussion around headers/schemas/etc. in another proposal.

ghost commented 10 years ago

I should have added this comment when closing this issue - addressing @edgar-bonet original example:

[
    [1.23, 4.56],
    [7.89, 0.12]
]

this would now look like:

[[][$][[][#][i][2]
    [[][$][d][i][2][1.23][4.56]
    [[][$][d][i][2]7.89][0.12]

The change in this bug addressed the purpose/title of the bug (simply remove the limitation of STC not containing STC) but it did not address the proposal that @edgar-bonet had inside his recommendation and that is a way to efficiently define N-dimensional arrays.

There are 2 concerns that fell out from this discussion that still need to be addressed separately:

  1. N-dimensional arrays - I am not sure I want a special/different construct for these or if they can defined with a more generic format that applies to Arrays in general.
  2. Object and Array "schemas" defined in the header of the container and the payload for the container and subsequent nested containers are just raw data (carried into #48 and #50 )