mathiasbynens / utf8.js

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.
https://git.io/utf8js
MIT License
556 stars 115 forks source link

The README should probably mention that output only looks like UTF-8, but isn't actual UTF-8 #42

Open gh-andre opened 2 years ago

gh-andre commented 2 years ago

This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.

Take your README example with the copyright character.

// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'

, each of \xXX sequences in JavaScript produces a standalone code point, so \xA9 natively will be represented as UTF-16 in JavaScript (well, UCS2, really), which can be seen here:

console.log(Buffer.from('\xA9', 'utf16le')

, which yields a code point U+00A9 in little endian notation:

<Buffer a9 00>

This is how one can generate an actual UTF-8 sequence. Either of these will work (the default encoding is UTF-8):

console.log(Buffer.from('\xA9'))
console.log(Buffer.from('\xA9', 'utf8'))

, and will produce UTF-8 bytes, which are good for hashing and other uses where it matters:

<Buffer c2 a9>
<Buffer c2 a9>

For example, this yields the correct MD5 hash of the \xA9 represented as UTF-8 because update does the same transformation Buffer.from uses:

console.log(crypto.createHash('md5').update('\xA9').digest('hex'))

, which is a541ecda3d4c67f1151cad5075633423. This will not produce the correct hash:

console.log(crypto.createHash('md5').update(utf8.encode('\xA9')).digest('hex'))

, which actually hashes <Buffer c3 82 c2 a9> and yields 1b4c0262ce2f67450c4ecb3026ab1350.

This fooled even Microsoft, who referenced utf8 in their docs, which only works because their input is always ASCII, which makes utf8.encode() a no-op.

https://docs.microsoft.com/en-us/rest/api/eventhub/generate-sas-token#nodejs

mathiasbynens commented 2 years ago

Do you want to propose a patch?

gh-andre commented 2 years ago

I came across of this module because it was used in my project in the way Microsoft describes in their docs, which didn't work for non-ASCII characters. The fix was simple in my case - just use Buffer.from(), which natively produces UTF-8, but it took me a bit to realize what's going on and I thought a note in README clarifying that generated strings are intended for display purposes would save time for other people.

I wouldn't be the best person to propose a description for this, though, because I'm not familiar with project history and intent. If you think it's clear enough what encode/decode produce, please, go ahead and close this issue. Sorry about the issue, in this case.

charmander commented 2 years ago

This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.

The output is UTF-8 represented as a string with one byte per character, which can be easy to misuse – as you’ve seen – but is very much a thing. It’s the input format escape and btoa expect, for example. If you ever have it in Node.js for some reason, it’s the binary encoding, e.g. Buffer.from(utf8.encode(text), 'binary').

As seen in the readme:

utf8.js has been tested in at least Chrome 27-39, Firefox 3-34, Safari 4-8, Opera 10-28, IE 6-11, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.11, PhantomJS 1.9.0, and Rhino 1.7RC4.

this package supports environments that don’t even have typed arrays.

In Node.js and modern browsers, UTF-8 encoding directly to bytes is built in as Buffer and TextEncoder.

gh-andre commented 2 years ago

@charmander

If you ever have it in Node.js for some reason, it’s the binary encoding, e.g. Buffer.from(utf8.encode(text), 'binary').

Note that binary is a synonym for latin1, (ISO-8859-1, so what happens is that in Buffer.from(utf8.encode('あ'), 'latin1'), the encode call yields a JS string "\u00e3\u0081\u0082", which is then encoded in Latin1 as E3 81 82 by Buffer.from(). This matches the sequence generated by Buffer.from('あ', 'utf8'). I can see how it works out for people where Buffer-like functionality is not available.

In new code one can also use new TextEncoder().encode('あ') , which yields Uint8Array with UTF-8 code unit values.