Closed edgar-bonet closed 10 years ago
Real use cases in production that uses new additions?
@AnyCPU can you clarify what you mean? which new additions specifically? (no one is using anything from this proposal or #48 because they haven't been ratified yet)
Hey guys, I'd tried to follow the discussion for the whole day, but really I'm lost in about what problems you're trying to discuss and to which proposal points they are relates. How about to someone submit STC proposal as PR so we can discuss exactly by paragraphs and localize the issue without instead of apply them for the whole specification text?
@kxepal It is a long discussion, let me summarize:
Guys, please summarize anything I missed.
Overall it looks like opening the doors to having STCs contain STCs will most likely happen - and the shape of #48 because of @Steve132 recent feedback may morph a little... see my latest comment there - https://github.com/thebuzzmedia/universal-binary-json/issues/48#issuecomment-49348478
I'm talking about from where they came? Before proposal to make new additions, they are has been probed in production?
As I understand that UBJSON is more storage data format instead of structural/container format.
With strongly typed array that have only one type, but that have multiple inner arrays, we still will have reduced parsing speed, because such array will be parsed in any cases, for example, when skipping or seeking some object.
Storage must be simple as possible.
With strongly typed array that have only one type, but that have multiple inner arrays, we still will have reduced parsing speed, because such array will be parsed in any cases, for example, when skipping or seeking some object.
Great point, in the DB storage format, you are right, if you are seeking on disk and hit a record that you want to seek past - the current form of UBJSON would need to be augmented with a DB-centric format that would have the complete structure size, in bytes, so it could be skipped over.
I don't think this is a shortcoming of UBJSON necessarily, I think this can be easily augmented with a DB-specific format similar to how systems like CouchDB augment standard JSON payloads with their "_name" params.
In short, what I'm saying is that regardless of what we figure out here, that doesn't necessarily make that problem unsolveable.
This discussion thread ended up being very valuable - a lot of good ideas in here and I'll continue to break out sub-ideas into other proposals but tie up this discussion for the sake of moving the spec forward.
ACCEPTED - http://ubjson.org/type-reference/container-types/#optimized-format
Added to the spec the ability to define a container marker as a value for $ - easiest change.
We can pickup the discussion around headers/schemas/etc. in another proposal.
I should have added this comment when closing this issue - addressing @edgar-bonet original example:
[
[1.23, 4.56],
[7.89, 0.12]
]
this would now look like:
[[][$][[][#][i][2]
[[][$][d][i][2][1.23][4.56]
[[][$][d][i][2]7.89][0.12]
The change in this bug addressed the purpose/title of the bug (simply remove the limitation of STC not containing STC) but it did not address the proposal that @edgar-bonet had inside his recommendation and that is a way to efficiently define N-dimensional arrays.
There are 2 concerns that fell out from this discussion that still need to be addressed separately:
Hi!
The current (2014-05-23) version of the specification seems to only allow primitive types (called “value” types) inside strongly typed arrays. As an enhancement, I am requesting that strongly typed arrays be acceptable as elements of strongly typed arrays. This would provide optimization for multidimensional arrays, without resorting to any new syntactic construct.
Example
I have (I really do) a few huge JSON files, each holding, among other things, a thousand arrays that look like this:
I am looking for a more efficient, JSON-compatible, binary representation of the same data. “Unoptimized” UBJSON yields:
With this representation, each datapoint costs 8 bytes of data (float32 is enough precision for me), plus 4 bytes of overhead. That's 50% overhead, not so good.
The same array as “optimized” UBJSON is:
Now we have 8 bytes of data + 6 bytes of overhead per datapoint. That's 75% overhead, so the optimization is obviously not good for these small inner arrays.
Per the current proposal, the outer array can also be optimized, which yields the following “recursively optimized” UBJSON:
Now we have a really optimized layout with zero overhead.
And importantly, we are not introducing any new syntax, but only specifying that the “type marker” of a strongly typed array is:
In the above example, the type marker of the outer array ("
[$[$d#i<2>#I<3200>
" for short) would be recursively parsed as:Regards, and thanks for the good job.