ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Add support for compact representations of JSON Strings with Base64 content #23

Closed ghost closed 11 years ago

ghost commented 11 years ago

A popular pattern for embedding binary objects in JSON is to store their value as a Base64 encoded String according to:

http://tools.ietf.org/html/rfc4648#section-4

What I propose here is an optimized representation for such strings in the 8bit UBJSON data. (Base64 encoded data consumes %25 fewer 8bit bytes when decoded)

Note: The encoding rules below would allow an implementer to take any JSON String with alpha-numeric content which can be interpreted as one of the valid Base64 encodings and record it using the more compact 8bit Base64-decoded representation in the UBJSON data.

There are 3 common formats used for Base64Encoding. I would like to propose the following 3 markers be used for each:

[E] - Base64 [E]ncoding without "=" padding characters [P] - Base64 encoding with "=" [P]adding characters [U] - [U]RLsafe Base64 Encoding without "=" padding characters

Encoding:

For JSON Strings containing only characters from the list A-Z, a-z, +, and /

[E][numeric-type][length][Base64Decoded-bytes]

For JSON Strings containing only characters from the list A-Z, a-z, +, /, and =, where the String length is a multiple of 4 characters, and the = character is only included as either the last or the last two characters

[P][numeric-type][length][Base64Decoded-bytes]

For JSON Strings containing only characters from the list A-Z, a-z, -, and _

[U][numeric-type][length][Base64Decoded-bytes]

Decoding:

For marker [E] the JSON String is the Base64 encoding of the Base64Decoded-bytes content without "=" padding characters

For marker [P] the JSON String is the Base64 encoding of the Base64Decoded-bytes content with the "=" padding characters

For marker [U] the JSON String is the Base64url encoding of the Base64Decoded-bytes content without the "=" padding characters

ghost commented 11 years ago

I am confused about the optimization here over just storing a Base64-encoded string as a standard [S]tring element -- because they are UTF-8 encoded, this is effectively the ASCII char set, which is 8bit (1byte per char).

I see this as a specialized container, not as a big storage optimization... am I understanding that correctly or maybe I missed something? It's late on a Friday ;)

ghost commented 11 years ago

This JSON String: "KioqOC1iaXQgQmluYXJ5IENvbnRlbnQqKio"

Currently must be written as this UBJSON: [S][i][35][KioqOC1iaXQgQmluYXJ5IENvbnRlbnQqKio]

But could be written as the following UBJSON using this feature: [E][i][26][_8-bit Binary Content_]

The standard string representation consumes 38 bytes while the compact representation consumes 29 bytes.

Both UBJSON values evaluate back to the original JSON.

kxepal commented 11 years ago

This looks reasonable addition if UBJSON aims to be JSON friendly. E markers will sings about additional content encoding, helping transform it back to the original state. However, wouldn't it hit same problems that was in issue #11 ?

ghost commented 11 years ago

@syount A few things, let look at your first version of this post:

This JSON String:
"SGVsbG8gV29ybGQ"

Currently must be written as this UBJSON:
[S][i][15][SGVsbG8gV29ybGQ]

But could be written as the following UBJSON using this feature:
[E][i][11][Hello World]

In that case, I don't like two things:

  1. Every library impl needing to carry around a Base64 encoder/decoder functionality in it.
  2. If the client wants to store a standard string encoded as base64 with the additional 25% overhead, I have no problem with that. It is inefficient for sure, but I don't think the libraries/format should decide better than the caller.

Now, to your last post:

This JSON String:
"KioqOC1iaXQgQmluYXJ5IENvbnRlbnQqKio"

Currently must be written as this UBJSON:
[S][i][35][KioqOC1iaXQgQmluYXJ5IENvbnRlbnQqKio]

But could be written as the following UBJSON using this feature:
[E][i][26][***8-bit Binary Content***]

I see the example is about binary content and no longer string content, ok.

When you say 8-bit Binary Content in the last example, what exactly is that content payload? UBJSON doesn't support a binary type (yet), so decoding the Base64 into a binary payload and storing it isn't an option at this time. I think that is why I am confused... maybe because of the discussions around the [B]YTE type and the STC (strongly typed containers) you assumed we could do this now?

If that is the case, is your suggestion, more or less: unwrap inefficient Base64 for binary data and store the raw Binary?

If so, I think this is an interesting idea we should consider further when we get to that point. Just wanted to make sure I understood it.

adilbaig commented 11 years ago

In that case, I don't like two things:

Every library impl needing to carry around a Base64 encoder/decoder functionality in it. If the client wants to store a standard string encoded as base64 with the additional 25% overhead, I have no problem with that. It is inefficient for sure, but I don't think the libraries/format should decide better than the caller.

I second this opinion. As @kxepal said, it will hit the same problems that were raised in issue #11. As a rule, compression schemes should not be a part of the spec.

@syount - What are your thoughts on #25 ?, a proposal for an efficient way to store any data set (especially binary).

ghost commented 11 years ago

Closing.

ghost commented 11 years ago

Yes good summary, I'd like to "unwrap inefficient Base64 for binary data and store the raw Binary" in UBJSON.

Unfortunately by design, gzip/deflate will process data byte-by-byte. As a consequence, Base64 encoded data (which breaks the original data's 8-bit byte alignment), will almost never be able to compress as well as the original data. The implication here is that any gziped/deflated UBJSON payloads will be significantly more compact over the wire if they contain the raw Binary rather than the Base64 encoded binary data.

Here's an HTML5 example use-case for where this kind Base64 encoded binary data can be useful: http://simeonvisser.hubpages.com/hub/HTML5-Tutorial-How-To-Use-Canvas-toDataURL

For what it's worth: Free off-the-shelf Base64 encoders/decoders are everywhere. And if one isn't, Base64 encoders/decoders are trivial to implement...

I'll have a look at #11 and #25.

ghost commented 11 years ago

@kxepal I don't think this runs into the same problems raised in #11.

The 3 Base64 encodings outlined above are standard, portable, and unencumbered by IP restrictions: http://tools.ietf.org/html/rfc4648#section-4

While the functionality described does produce a more compact representation for the UBJSON data it is not compression scheme per se.

It is an alternate compact encoding for JSON String content in cases when the UBJSON encoder determines that the JSON String type content can be interpreted as Base64 data.

This can be thought of as analogous to the various interchangeable compact encodings available for encoding JSON Numeric type content.

AnyCPU commented 11 years ago

I think that Base64 can be used on higher level - application or custom proto. It's plus because a transport stays the same and app level can be changed effectively apart from transport.

transport == ubjson.