mathiasbynens / utf8.js

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.
https://git.io/utf8js
MIT License
556 stars 115 forks source link

Deal in scalar values #3

Closed mathiasbynens closed 9 years ago

mathiasbynens commented 10 years ago

Should we disallow lone surrogates as per https://github.com/whatwg/encoding/commit/4abe74d1400c5ab8913c5f229b59b237ae5aac51? cc @jwerle

jwerle commented 10 years ago

Is the goal to be whatwg/encoding compliant ?

mathiasbynens commented 10 years ago

Ok, there is now a separate WTF-8 encoding specified (thanks to @SimonSapin) for UTF-8 with added support for lone-surrogate byte sequences. JS library: https://github.com/mathiasbynens/wtf-8

So let’s make utf8.js deal with actual UTF-8 as per the encoding standard.

jwerle commented 10 years ago

awesome !

jwerle commented 9 years ago

nice !

cblair commented 8 years ago

I had a question about this enhancement; I'm getting the 'Error: Lone surrogate XXX is not a scalar value' error after some tests that feed in random strings. The string is not valid as is, but I think it should be encodable into a string that is valid. This error is valid, but does utf8.js has the ability to encode lone surrogate code points if they're not in a pair?

The https://simonsapin.github.io/wtf-8/ page states that WTF-8 (great name) 'encodes surrogate code points if they are not in a pair'. I don't think this was added to utf8.js, but hopefully I'm missing something.

I.e., string 'í¹ºò‡¢Ÿâ¼ší¹ºâ¼š' in byteArray has values 237,185,186,242,135,162,159,226,188,154,237,185,186,226,188,154 at utf8.js:70. At byteIndex == 3, codepoint == 56954 (0xDE7A). This throws the error.

For reproducibility, the original string (value) is "7bm68oein+K8mu25uuK8mg==", base 64 endcoded. I hit this error with the following JS:

utf8.decode(atob(value))

Thanks so much.

SimonSapin commented 8 years ago

https://simonsapin.github.io/wtf-8/#motivation has some background on what are surrogate code points and how they came to be.

The Unicode standard defines byte sequences in <ED, A0...BF, 80...BF> (that would otherwise represent surrogate code points in U+D800...U+DBFF) to be ill-formed in UTF-8. utf8.js deliberately rejects them.

Similarly, JavaScript strings are arbitrary 16-bit sequences and are not necessarily well-formed in UTF-16.

WTF-8 is designed to be able to encode any 16-bit sequences (such as JS strings) in a way compatible with UTF-8, but it is not UTF-8. You probably shouldn’t be using WTF-8.

cblair commented 8 years ago

Thanks Simon, makes sense. Our trouble is that we're using utf8 in some of our functional test code, and we want to feed in these illegal code points into our production code. This exception stops us from doing that.

Short term solution for us is to hold back to version 2.0.0. But maybe they'll be an allowed case for doing decodes on this in the future. Maybe I'll propose a PR at some time. :)

SimonSapin commented 8 years ago

Out of curiosity: why do you need this?

cblair commented 8 years ago

We have a String class for some internal code that's driven off our own specs and internal requirements. So we have to allow utf8 decodes of bad stuff in our test code, so that we can verify our String code catches it. The utf8 code is being too full featured, our internal code has to implement that feature!

I admit, its kind of a fringe use case. But, its kind of nice to be able to say specifically 'Decode this, despite some bad stuff I'm putting in'. For our use, we just want the bytes and the right amount of them, right or wrong.