Open gh-andre opened 2 years ago
Do you want to propose a patch?
I came across of this module because it was used in my project in the way Microsoft describes in their docs, which didn't work for non-ASCII characters. The fix was simple in my case - just use Buffer.from()
, which natively produces UTF-8, but it took me a bit to realize what's going on and I thought a note in README clarifying that generated strings are intended for display purposes would save time for other people.
I wouldn't be the best person to propose a description for this, though, because I'm not familiar with project history and intent. If you think it's clear enough what encode
/decode
produce, please, go ahead and close this issue. Sorry about the issue, in this case.
This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.
The output is UTF-8 represented as a string with one byte per character, which can be easy to misuse – as you’ve seen – but is very much a thing. It’s the input format escape
and btoa
expect, for example. If you ever have it in Node.js for some reason, it’s the binary
encoding, e.g. Buffer.from(utf8.encode(text), 'binary')
.
As seen in the readme:
utf8.js has been tested in at least Chrome 27-39, Firefox 3-34, Safari 4-8, Opera 10-28, IE 6-11, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.11, PhantomJS 1.9.0, and Rhino 1.7RC4.
this package supports environments that don’t even have typed arrays.
In Node.js and modern browsers, UTF-8 encoding directly to bytes is built in as Buffer
and TextEncoder
.
@charmander
If you ever have it in Node.js for some reason, it’s the binary encoding, e.g. Buffer.from(utf8.encode(text), 'binary').
Note that binary
is a synonym for latin1
, (ISO-8859-1, so what happens is that in Buffer.from(utf8.encode('あ'), 'latin1')
, the encode
call yields a JS string "\u00e3\u0081\u0082"
, which is then encoded in Latin1 as E3 81 82
by Buffer.from()
. This matches the sequence generated by Buffer.from('あ', 'utf8')
. I can see how it works out for people where Buffer
-like functionality is not available.
In new code one can also use new TextEncoder().encode('あ')
, which yields Uint8Array
with UTF-8 code unit values.
This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.
Take your README example with the copyright character.
, each of
\xXX
sequences in JavaScript produces a standalone code point, so\xA9
natively will be represented as UTF-16 in JavaScript (well, UCS2, really), which can be seen here:, which yields a code point U+00A9 in little endian notation:
This is how one can generate an actual UTF-8 sequence. Either of these will work (the default encoding is UTF-8):
, and will produce UTF-8 bytes, which are good for hashing and other uses where it matters:
For example, this yields the correct MD5 hash of the
\xA9
represented as UTF-8 becauseupdate
does the same transformationBuffer.from
uses:, which is
a541ecda3d4c67f1151cad5075633423
. This will not produce the correct hash:, which actually hashes
<Buffer c3 82 c2 a9>
and yields1b4c0262ce2f67450c4ecb3026ab1350
.This fooled even Microsoft, who referenced
utf8
in their docs, which only works because their input is always ASCII, which makesutf8.encode()
a no-op.https://docs.microsoft.com/en-us/rest/api/eventhub/generate-sas-token#nodejs