ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
116 stars 12 forks source link

Byte Length Parser Hint #92

Open edwardaskew opened 6 years ago

edwardaskew commented 6 years ago

TL;DR I think you should add a marker type to describe the byte length of arbitrary elements to allow for lazy parsing of complex data. Also as I'm new to ubjson please forgive any misunderstandings in how things currently work.

So first a little background on what led me here:

I found ubjson while looking for ways of encoding large data sets of geometric features. GeoJSON is a nice, clean, simple way of doing this but unfortunately it starts to run into performance issues when parsing very large data sets of complex geometries.

Importantly to this story, GeoJSON is typically made up of an array of Features which have both properties (general key-value pairs used to describe non-geometric aspects of the Feature) and geometric data attached. The geometric part is made up of a (potentially) highly nested mess of arrays which boils down at the leaf nodes to 2 (or 3 for 3d) member arrays of numbers as coordinates. Standard JSON parsers seem to particularly struggle with this structure, I think because they often (over-)optimistically allocate memory for arrays which causes a lot of memory churn when dealing with a lot (it's not terribly unusual for me to deal with 1000+ coordinate geometries) of 2 element arrays.

This is particularly galling for me as often I only want to deal with the geometries of a subset of the original features filtered by their properties so much of this expensively parsed geometry gets discarded anyway. Ubjson strongly typed arrays look to be a brilliant fit for this kind of data which could drastically help the parsing effort but it still leaves me wondering why parse geometries that are not going to be used in the first place.

My suggestion is to add a valueless value (like NOOP currently is) with a length which represents the length in bytes of the next value (including the type byte and, if applicable, that value's length) to allow a parser that so wishes to store a reference to that number of bytes for later on-demand parsing and continue with the rest of the data. My suggestion is to us [!] (0x21) as the type marker to clearly distinguish it from normal "valuable" values although this isn't vital to the proposal.

Benefits

As described, the primary benefit to me would be the ability to write a lazy parser for a ubGeoJSON (GeoUBJSON?) format although the ability to mark out byte lengths of chunks of data could be useful in all sorts of other situations (ubjson B-Trees anyone? ☺️)

Changes required to current parsers

Current parsers would only have to be modified to ignore such a marker and its length, the mark is there as an optimisation hint only and should in no way effect the fully parsed document. Modifying parsers to provided length hinted values lazily on demand is left as an exercise for the reader.

Changes required to current serializers

None, the length hints are entirely optional on serialisation.

Edge cases

Example

[!][u][20]
[[]
    [D][0.1234]
    [D][32.434]
[]]

Represents exactly the same thing as

[[]
    [D][0.1234]
    [D][32.434]
[]]

but without the need to parse to the closing []] to determine the location of the end of the array (clearly that would be trivial in this case but you get the picture hopefully).

Steve132 commented 4 years ago

I don't understand the advantage here. Wouldn't an implementation provide this as a STC or nd-array which would allow the same perfomance benefit?

I don't like encoding things with special meanings for the state machine of the parser.