microsoft / bion

Binary object notation - a standard for representing JSON in a compact, efficient format
MIT License
15 stars 8 forks source link

But why...? #14

Closed AlgorithmsAreCool closed 4 years ago

AlgorithmsAreCool commented 4 years ago

I don't intend to insult anyone's efforts but binary json formats seem like a very well trodden path.

And since the readme doesn't provide much context, it seems confusing.

If this is coming out of Microsoft Research, is there a paper describing it?

If this is an engineering initiative, then how does improve on existing formats such as BSON, CBOR and MSGPACK?

ScottLouvau commented 4 years ago

BION is early work. so there's not much documentation in place yet. There was a format description in the wiki, but it looks like it's gone. I'll have to figure out how to restore it. :/ BION isn't a Microsoft Research project.

BION has a few unique design features. It uses illegal UTF-8 bytes - 0xF5 to 0xFF - for all structural tokens. Strings and containers may be length prefixed or use an illegal UTF-8 byte terminator. This allows writing values when the length isn't known in advance, but makes finding the end of things extremely fast. This design also means that string searches (whole values, partial values, and propertyName : value) can be done directly against the file bytes without having to parse and interpret the file, and the structural tokens ensure you can parse backwards to identify where in the document a match was found if you do this.

BION specifies an optional container index, which describes the overall structure of the document and allows readers to select or skip subtrees of the document freely during reading. This can allow readers to choose to partially load large documents and dynamically decide what to defer or skip loading based on the size of the document, all without a full parse.

The current active work is on the "BSOA" offshoot of BION. If you have a consistent schema (a fixed set of object types, with properties which always have the same type), you can use BSOA. BSOA writes values in a columns as blocks, so it's more like formats like Apache Arrow. BSOA provides gigabyte-per-second read and write speeds from a single thread, no data conversion to the in-memory representation, a familiar "with schema" model, like JSON, the ability to add or omit specific columns or tables from an on disk log, and a friendly object model which looks like a normal C# object model but interacts with the columnar data inside, with full mutability on the data inside.

AlgorithmsAreCool commented 4 years ago

Thank you for answering, it sounds like a very cool project.

I noticed it under the Microsoft github organization and got curious.

Happy Coding.

ScottLouvau commented 4 years ago

No problem. =) Are you looking for something to load or process large JSON data with?

AlgorithmsAreCool commented 4 years ago

Hmm, I have a lot (many TB) of semi-structured, zipped log data that needs a more efficient storage format. I have a bespoke parser that works pretty well after dozens of hours of work, but would be happy to ditch it for something off the shelf.

My primary concerns for the storage are:

  1. Compact Storage
  2. Fast bulk read perf
  3. Decent bulk write perf

A nice to have would be the ability to skip columns, but that is a performance concern so subordinate to point 2 above. I would need to keep the data compresses at rest. Integrated compression would be "neat", but i really just need the to be able to read the data in as a stream (as opposed to a mapped file or something).

AlgorithmsAreCool commented 4 years ago

Actually, re-reading your comment, being able to search in place would be a killer feature for my use case, hmm...