ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Add support for a binary type #12

Closed ghost closed 10 years ago

ghost commented 12 years ago

This has been by far the most requested addition, but my fear of incompatibility has always kept it at bay.

At least filing the request here incase someone has a compelling argument as to why this must go into the spec, compatibility be damned! :)

kxepal commented 12 years ago

While you could call me heretic, this is very reasonable request.

Since UBJSON's STRING type holds data in UTF-8 encoding it becomes unfriendly type for binary data - it may be stored in this type only with additional encoding e.g. base-64. This way is JSON friendly, but it produces additional overhead for about 0.33 from original data size.

Main problem of base64 encoding is that UBJSON is binary format too, so base64-encoded binary data within binary format will be encoded to base64 twice.

Another solution is to store binary data as an array of bytes:

[A]
    [i] [10]
    [i] [123]
    [i] [241]
    ...
[E]

This is quite tricky way, but it provides no overhead(ok, 2 bytes at all) for stored data. This case could be used similar to JavaScript ArrayBuffer type which serializes to JSON in similar form.

Since ARRAY type is unbounded it may be safely used for chunked reading.

If we decide to add binary type to UBJSON, we're also need to add chunked strings from issue #10 because binary data mostly huge bucket of bits. But we couldn't map this type directly to JSON:

Assuming that binary type takes B marker, which is free for use since Draft-9:

[B] => base64 encoded string
base64 encoded string => [S]

while arrays could be mapped directly and lossless.

UPD: early post, sorry

ghost commented 12 years ago

I agree on B for binary, with the same numerical length support as String and High-precision, but the intention was to have it store raw bytes as its payload, much like the raw UTF8 bytes in a string.

The problem is what that means when moving to JSON and back again... As you pointed out, it must mean base64 encoding the binary payload to JSON and decoding it on the way back into UBJSON.

As you mentioned, I feel like a heretic here as well since this is one of the big tenants of the spec (totally 100% compatible with JSON)

Maybe the importance of compatibility with JSON is an invented requirement in my head and not actually that I important?

AnyCPU commented 12 years ago

Guys, when I hear binary data I'm always imagine it like array of raw bytes. Every array is a object data and counter. So, it is like strongly typed array: Arr { type = 'raw byte'; count = '9999'; // or count = '*'; data = { 0x00, 0x01, ..., 0x1199, ... } }

kxepal commented 12 years ago

Guys, when I hear binary data I'm always imagine it like array of raw bytes.

+1, me too, but probably we could get this array of bytes more optimized with typed array's, so there wouldn't be any reasons for new types that breaks JSON thing in UBJSON name.

ghost commented 12 years ago

I might not be understanding the point (please correct me if I am not) -- but the only support for binary I envision is:

[B][i][37][... raw binary bytes ...] [B][L][123891298332][... huge raw byte payload ...]

OR as is being proposed in the other Issue:

    [B][*]
        [B][i][45][... raw bytes ...]
        ... more byte elements ...
    [E]

It would have to be strongly typed element added to the spec (we can't overload the use of ARRAY [[]) beause library implementors and callers wouldn't know what they were getting... for the same reason a chunked string is different than an array of string pieces.

Alex, to your point about "so there wouldn't be any reasons for new types that break JSON" -- unfortunately if we tackled binary in any way, that breaks our compatibility with JSON, but it seems like it might be worth it?

adilbaig commented 12 years ago

unfortunately if we tackled binary in any way, that breaks our compatibility with JSON

Unfortunately, as with most first generation formats, JSON is not quite powerful enough. I believe binary data containers will make it into mainstream Javascript (ArrayBuffer) and JSON will have to adopt to it. Until then, if we want to be perfectly reproducible with JSON (a fine goal) we only have the option of using convention over specification.

I'm going to risk getting shot and say, lets not have binary data blobs. Users of this format have the option of using an array of integers/bytes to represent binary blobs. Yes, it can waste a bit of space but it transports perfectly. We risk no breakage with JSON, its not expensive to process and it handles cases where middleware/proxy systems that consume JSON can reproduce the data back perfectly.

As to the question of what the binary represents, that's something that should be tackled on an application level completely (for example, by using meta data key-value pairs in JSON objects).

kxepal commented 12 years ago

Alex, to your point about "so there wouldn't be any reasons for new types that break JSON" -- unfortunately if we tackled binary in any way, that breaks our compatibility with JSON, but it seems like it might be worth it?

It wouldn't break things if we take typed arrays with us - for JSON they are could be transparently transformed into plain arrays, but they wouldn't be so effective as in UBJSON and that's would be another point to this format.

I'm going to risk getting shot and say, lets not have binary data blobs. Users of this format have the option of using an array of integers/bytes to represent binary blobs.

Going to agree since there is safety alternative. Could you comment something about issue #10?

AnyCPU commented 12 years ago

I see no big difference between #10 and #12 because it can be presented in uniform way. And we don't add a new data type like blob, it is still same array with optimized type specifier.

adilbaig commented 12 years ago

still same array with optimized type specifier.

@AnyCPU . The only issue is when you convert a byte array to JSON, you can represent it as an array of ints (using convention over specification), but when you convert that back its an array containing ints, not a blob.

Regarding chunked types, the way its currently being drafted, adding (or removing) support for binary will be dead simple, which is great. It converts well to json because its a just a staging maneuver in the spec, not a new type in JSON. Very cool.

ghost commented 12 years ago
  1. This conversation has primarily moved to Issue 13 (as there are multiple concerns being solved with the proposed addition of a single data structure).
  2. @adilbaig @AnyCPU and @kxepal -- I think I was misunderstanding what you were saying up until now about using "an array of int8s" to represent binary -- I think I am on the same page now and have added my thoughts and concerns to Issue 13.

Awesome conversation around all of this guys, really appreciate it!

-- will leave this bug open until we settle the issue on the other thread.

ghost commented 10 years ago

Added to Draft 10 - documented here: http://ubjson.org/type-reference/container-types/#optimized-format