tc39 / proposal-arraybuffer-base64

TC39 proposal for Uint8Array<->base64/hex
https://tc39.github.io/proposal-arraybuffer-base64/
MIT License
229 stars 8 forks source link

Uint8Array to/from base64 and hex

base64 is a common way to represent arbitrary binary data as ASCII. JavaScript has Uint8Arrays to work with binary data, but no built-in mechanism to encode that data as base64, nor to take base64'd data and produce a corresponding Uint8Arrays. This is a proposal to fix that. It also adds methods for converting between hex strings and Uint8Arrays.

It is currently at stage 3 of the TC39 process: it is ready for implementations. See this issue for current status.

Try it out on the playground.

Spec text is available here, and test262 tests in this PR.

Implementers may be interested in the open-source simdutf library, which provides a fast implementation of a base64 decoder which matches Uint8Array.fromBase64(string) (including handling of whitespace) when it is called without specifying any options. As of this writing it only works on latin1 strings, but a utf16 version may be coming.

Basic API

let arr = new Uint8Array([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]);
console.log(arr.toBase64());
// 'SGVsbG8gV29ybGQ='
console.log(arr.toHex());
// '48656c6c6f20576f726c64'
let string = 'SGVsbG8gV29ybGQ=';
console.log(Uint8Array.fromBase64(string));
// Uint8Array([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100])

string = '48656c6c6f20576f726c64';
console.log(Uint8Array.fromHex(string));
// Uint8Array([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100])

This would add Uint8Array.prototype.toBase64/Uint8Array.prototype.toHex and Uint8Array.fromBase64/Uint8Array.fromHex methods. The latter pair would throw if given a string which is not properly encoded.

Base64 options

Additional options are supplied in an options bag argument:

The hex methods do not take any options.

Writing to an existing Uint8Array

The Uint8Array.prototype.setFromBase64 method allows writing to an existing Uint8Array. Like the TextEncoder encodeInto method, it returns a { read, written } pair.

let target = new Uint8Array(8);
let { read, written } = target.setFromBase64('Zm9vYmFy');
assert.deepStrictEqual([...target], [102, 111, 111, 98, 97, 114, 0, 0]);
assert.deepStrictEqual({ read, written }, { read: 8, written: 6 });

This method takes an optional final options bag with the same options as above.

As with encodeInto, there is not explicit support for writing to specified offset of the target, but you can accomplish that by creating a subarray.

Uint8Array.prototype.setFromHex is the same except for hex.

Streaming

There is no explicit support for streaming. However, it is relatively straightforward to do effeciently in userland on top of this API, with support for all the same options as the underlying functions.

FAQ

What variation exists among base64 implementations in standards, in other languages, and in existing JavaScript libraries?

I have a whole page on that, with tables and footnotes and everything. There is relatively little room for variation, but languages and libraries manage to explore almost all of the room there is.

To summarize, base64 encoders can vary in the following ways:

and decoders can vary in the following ways:

What alphabets are supported?

For base64, you can specify either base64 or base64url for both the encoder and the decoder.

For hex, both lowercase and uppercase characters (including mixed within the same string) will decode successfully. Output is always lowercase.

How are the extra padding bits handled?

If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits, which means you have an extra 4 or 2 bits which don't encode anything.

Per the RFC, decoders MAY reject input strings where the padding bits are non-zero. Here, non-zero padding bits are silently ignored unless lastChunkHandling: "strict" is specified.

How is whitespace handled?

The encoders do not output whitespace. The hex decoder does not allow it as input. The base64 decoder allows ASCII whitespace anywhere in the string.

How are other characters handled?

The presence of any other characters causes an exception.

Why are these synchronous?

In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous.

Why just these encodings?

While other string encodings exist, none are nearly as commonly used as these two.

See issues #7, #8, and #11.

Why not just use atob and btoa?

Those methods take and consume strings, rather than translating between a string and a Uint8Array.

Why not TextEncoder?

base64 is not a text encoding format; there's no code points involved. So despite fitting with the type signature of TextEncoder/TextDecoder, base64 encoding and decoding is not a conceptually appropriate thing for those APIs to do.

That's also been the consensus when it's come up previously.

What if I just want to encode a portion of an ArrayBuffer?

Uint8Arrays can be partial views of an underlying buffer, so you can create such a view and invoke .toBase64 on it.