json5 / json5-spec

The JSON5 Data Interchange Format
https://spec.json5.org
MIT License
49 stars 11 forks source link

Recommend an encoding for binary data #35

Open mindplay-dk opened 2 years ago

mindplay-dk commented 2 years ago

Have you thought about some sort of support for embedding binary data? (blobs)

Unicode strings are not generic - not all escape sequences are valid Unicode.

What people typically do, is they encode binary data in base64 format - it's not very efficient or elegant, but probably okay for smaller binary chunks.

I wonder if we can think of something better?

If not, perhaps we could make a recommendation about how binary data should be encoded? Base64 sometimes uses different characters - RFC 4648 defines two encodings, one being URL safe, and several encodings with smaller character sets.

Personally, I like the "URL and Filename safe" variant - in the context of JSON, which will likely be served from URLs a lot of the time, it would be nice if programs could use the same library functions (with the same settings) to reliably decode JSON blobs, query-strings, post-data, etc.

What do you think, is this worth touching on in the spec?

Anything that makes JSON and the ecosystem around it more coherent is helpful, in my opinion.

jordanbtucker commented 2 years ago

Definitely. I prefer to use Base64 to encode binary in JSON and JSON5. Other APIs, like Node.js have opted to use arrays of numbers to serialize binary data into JSON. I'm interested in more discussion on the pros and cons of each, and whether there are other viable options as well. For the purpose of interoperability, it may be useful to give a recommendation in the spec based on our collective findings.

mindplay-dk commented 2 years ago

Other APIs, like Node.js have opted to use arrays of numbers to serialize binary data into JSON.

That sounds crazy. πŸ˜„

Where did you come across that?

jordanbtucker commented 2 years ago

JSON serialization of Buffer in Node.js:

JSON.stringify(Buffer.from('foobar'))
{
  "type": "Buffer",
  "data": [
    102,
    111,
    111,
    98,
    97,
    114
  ]
}

Even stranger is the JSON serialization of TypedArray in JavaScript:

JSON.stringify(Int8Array.from(Buffer.from('foobar')))
{
  "0": 102,
  "1": 111,
  "2": 111,
  "3": 98,
  "4": 97,
  "5": 114
}

It's quite interesting that JavaScript has the btoa function, yet no built-in JSON serializations use Base64. This paragraph from Binary strings on MDN (cached since the page has been removed), particularly its reference to "multiple conversions", may shed some light on that.

In the past, [manipulating raw binary data] had to be simulated by treating the raw data as a string and using the charCodeAt() method to read the bytes from the data buffer (i.e., using binary strings). However, this is slow and error-prone, due to the need for multiple conversions (especially if the binary data is not actually byte-format data, but, for example, 32-bit integers or floats).

mindplay-dk commented 2 years ago

It's quite interesting that JavaScript has the btoa function, yet no built-in JSON serializations use Base64

Probably because btoa has some ugly limitations

image

Things like toDataURL in Canvas presumably must use a different base64 encoder internally.

Also, if you had object properties containing buffers, these would get serialized as strings - which could be misleading. I mean, there's nothing about a string that safely tells you whether that string is base64 or just some other string.

This makes me wonder if we should recommend something that could be identified as binary?

I'm thinking Data URLs.

These are pretty universal by now as well - and it's arguably both safer, more useful, and more human-readable for binary data to be represented as a data URI than a bare base64 string.

Compare this:

{"pic": "R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAwAAAC8IyPqcvt3wCcDkiLc7C0qwyGHhSWpjQu5yqmCYsapyuvUUlvONmOZtfzgFzByTB10QgxOR0TqBQejhRNzOfkVJ+5YiUqrXF5Y5lKh/DeuNcP5yLWGsEbtLiOSpa/TPg7JpJHxyendzWTBfX0cxOnKPjgBzi4diinWGdkF8kjdfnycQZXZeYGejmJlZeGl9i2icVqaNVailT6F5iJ90m6mvuTS4OK05M0vDk0Q4XUtwvKOzrcd3iq9uisF81M1OIcR7lEewwcLp7tuNNkM3uNna3F2JQFo97Vriy/Xl4/f1cf5VWzXyym7PHhhx4dbgYKAAA7"}

With this:

{"pic": ""}

The latter ticks a lot of boxes:

βœ… It's clear (to machines + humans) that it's base64 encoded. (probably even safe enough to auto-decode in clients.) βœ… It specifies the content-type: you don't have to know or detect it. (probably safer in browser context, too.) βœ… It's web-friendly: you can inject this directly into an <img src="..."> etc.

There is RFC 2397 providing a formal specification that we could refer to - although this looks a little outdated:

The "data:" URL scheme is only useful for short values

According to MDN docs, Opera 11 did have a 64KB limit - but all modern browsers support at least 256MB, so this doesn't seem relevant anymore.

JSON itself has some practical size limitations either way, and you probably wouldn't/shouldn't embed hundreds of MB of data in JSON blobs, using base64, or anything else for that matter. Should the spec specify a size limit? Perhaps suggesting external URLs as an alternative, pointing to larger resources for clients to download after parsing the JSON.

Only down side I can see, is that data URLs may be less well-supported on the server-side than plain base64 is. I'm sure every language has at least a package for this by now though.

Under any circumstances, this would be a recommendation, right? Not a requirement.

jordanbtucker commented 2 years ago

The btoa function is designed to work on binary strings, which are sequences of Unicode code points in the range of U+0000 through U+00FF (i.e. the Latin-1 character set), where each code point represents a single byte. So, I wouldn't call it an ugly limitation but a necessary one. What byte should βœ“ (U+2713) represent?

toDataURL, on the other hand, converts the canvas to an image format, like PNG, and then encodes those raw bytes as Base64. So, btoa and toDataURL work on two different types of input.

Using data: URLs is an interesting thought. It would allow the string to hold more information than just the binary data. But what about when the data does not have a MIME type, for example an AES encryption key? ~I know that technically it would be data:;base64,hJxB...aso=~*, but does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64? Is that really necessary if two JSON5 applications have already negotiated that the key property is encoded as Base64?

And, yes, additions to the spec would be interoperability recommendations, similar to the ones you find in RFC 8259.

* A data URL with an unspecified MIME type implicitly has a MIME type of text/plain;charset=US-ASCII according to RFC 2397.

mindplay-dk commented 2 years ago

does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64?

No, but that's already worth something in itself, I think.

Having data with content-type is probably less common than having data with a known type. I don't think "some data has no type" is really an argument against having a type for everything else?

And in that marginal case, it's a recommendation - you don't have to follow it if it doesn't make sense for your use-case.

Is that really necessary if two JSON5 applications have already negotiated that the key property is encoded as Base64?

No, but that same argument would work against a date format standard recommendation - if two applications have already negotiated that they're going to use the date property encoded in RFC 3339 format, the date format recommendation isn't useful either. (You seemed to support that idea?)

I think both of these recommendations would be useful - there are oodles of fun and interesting ways to encode both dates and binary data. Often, people will pick the one they know and happen to have close at hand - there's often no compelling reason to pick one format over another, so this would help with that choice.

It would simplify things if projects aligned towards one way of encoding these types - opening up to MIME types via data URLs would provide a safe way to encode and embed a lot of data formats, both binary and text.

That's just one guy's opinion of course. Would love to hear from other contributors. πŸ™‚

tracker1 commented 2 years ago

I like the data URL as well... I would think that strings would be UTF8 encoded into UInt8Array first of type text/utf8 and buffer or uint8 array would be binary ... Binary going into uint8 array... Other typed arrays being javascript/TYPEarray

mindplay-dk commented 2 years ago

Come to think of it...

... what about when the data does not have a MIME type, for example an AES encryption key? I know that technically it would be data:;base64,hJxB...aso=, but does that data:;base64, intro, 13 additional characters, really add anything beyond indicating that the data is encoded as Base64?

Actually, I believe this does add something: an extra layer of validation and explicitness.

Some base64 data is indistinguishable from text - that is, your app might expect base64, but somebody put a string in there that just happens to be valid base64, and decodes to some nonsense data, which triggers an obscure error further up the call stack, which could be very difficult to debug.

So yeah, that little 13 character preamble does have the benefit of letting somebody explicitly indicate base64 data.

Still, this would be a recommendation - you can deviate if it doesn't make sense for a given use case.

jordanbtucker commented 2 years ago

@tracker1 Strings don't necessarily need to be encoded as UTF-8 since JSON5 already has a string type, which is defined as a sequence of Unicode code points. JSON5 documents themselves are recommended to be encoded as UTF-8 however.

If you want to store the original UTF-8 representation of text in a Base64 data URL, and let's say that text is the HTML string <h1>Hello, World!</h1>, then the data URL would look like this:

data:text/html;charset=utf-8;base64,PGgxPkhlbGxvLCBXb3JsZCE8L2gxPg==

Also, specifying UInt8Array and TypedArray is a JavaScript-centric way of looking at things. JSON5 is meant to be used on all platforms, not just within the JavaScript ecosystem.

jordanbtucker commented 2 years ago

@mindplay-dk So, there's a snag with using data:;base64, for generic binary data. According to RFC 2397, a data URL with an unspecified MIME type implicitly has a MIME type of text/plain;charset=US-ASCII.

So, recommending data:;base64, for binary data with an unspecified MIME type actually goes against the data URL spec, and these URLs should be interpreted as plain text.

mindplay-dk commented 2 years ago

Right, time has not been good to this ol' standard.

Perhaps it would be helpful to also recommend not using an empty MIME type?

Honestly, it's the first time I've ever seen a data: URI with the MIME type omitted - I didn't even know that was possible.

And now that I know what the default is, it makes sense why nobody uses this. If you were actually including ASCII data, you would probably be better off using a MIME type like text/plain; charset=us-ascii to be explicit about it, since most likely a person would expect UTF-8 for text content today.

(For regular UTF-8 content, of course we can just use plain JSON strings rather than data: URIs anyhow.)

It's sort of a marginal case, I think? Probably a more common use-case will be embedding an image.

And if someone needs to embed an AES encryption key, a MIME type like application/x-aes, even if all it indicates to a program is "binary data", it likely still has value to a person, in terms of making the JSON easier to understand.

tracker1 commented 2 years ago

binary/octet-string would probably be appropriate mime type for general binary encoded data mime type.

jordanbtucker commented 2 years ago

The official IANA MIME type for arbitrary binary data is application/octet-stream as defined in RFC 2046. So, if you want to encode the bytes 80 80 80 80 as arbitrary binary data in Base64 in a data URL, you should use data:application/octet-stream;base64,gICAgA==.

I'd also like to point out some prior art regarding interoperability of JSON and JSON5 documents, which is what these recommendations are about. JSON Schema is the de facto standard for data contracts, validation, linting, and code completion of JSON documents, and it works just as well for JSON5. It's interesting that JSON Schema defines a contentEncoding and a contentMediaType property, yet it doesn't define a format for data URLs like it does other string formats like dates, email addresses, etc.

For example, a JSON5 document that represents an image file may look like this:

{
  filename: 'image-01.png',
  content: 'KBMPttgrVnXInj4j1ae+jw==',
}

It could have a JSON Schema (as a JSON5 document) like this:

{
  $schema: 'https://json-schema.org/draft/2020-12/schema',
  $id: 'https://example.com/image.schema.json5',
  title: 'Image File',
  description: 'An image with its filename',
  type: 'object',
  properties: {
    filename: {
      type: 'string',
    },
    content: {
      type: 'string',
      contentEncoding: 'base64',
      contentMediaType: 'image/png',
    },
  },
}

Granted, this forces all images to be PNGs.

However, if you were to use data URLs like this:

{
  filename: 'image-01.png',
  content: '',
}

then your schema could look like this:

{
  $schema: 'https://json-schema.org/draft/2020-12/schema',
  $id: 'https://example.com/image.schema.json5',
  title: 'Image File',
  description: 'An image with its filename',
  type: 'object',
  properties: {
    filename: {
      type: 'string',
    },
    content: {
      type: 'string',
      format: 'data-url',
    },
  },
}

but then you'd be using a non-standard data-url value for the format field. However, you gain the ability to represent more than just PNG files.

Granted, you aren't forced to use the JSON Schema contentMediaType property. You could just specify the media type in the JSON5 document directly, like this:

{
  filename: 'image-01.png',
  content: 'KBMPttgrVnXInj4j1ae+jw==',
  mediaType: 'image/png',
}

Anyway, the point I'm getting at is that JSON Schema doesn't have native support for data URLs, but it does have native support for Base64 strings and media types.