mathiasbynens / utf8.js

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.
https://git.io/utf8js
MIT License
556 stars 115 forks source link

Codepoint arrays and binary strings #7

Open devongovett opened 9 years ago

devongovett commented 9 years ago

What would you think about a PR to replace binary strings with arrays of bytes, or Buffers/typed arrays? e.g. accept arrays as input to the decoder, and produce them from the encoder.

Also, it would be nice to be able to pass arrays of codepoints to the encoder and receive an array of codepoints from the decoder instead of strings, perhaps as an option? Sometimes I need to do additional processing at the codepoint level, and it is probably a a waste of time to encode the utf8 to a ucs2 string, and then decode that again to get codepoints.

Thoughts? I'm happy to write PRs for this, just wanted to get your opinion first.

mathiasbynens commented 9 years ago

Sounds good, but I should rewrite this project first based on the exact algorithm in the Encoding Standard (see open issues).

devongovett commented 9 years ago

Hmm, looks like there is an implementation of that in the polyfill here. The algorithm that is specified looks like it would be kinda slow though. Might want to write something different that still conforms to the spec, as they suggest, rather than using their algorithm directly.

Have you seen this? A port to JS might be worthwhile. It's small, fast, and correct.

What are the current differences between this library and the standard, in terms of behavior?

mathiasbynens commented 9 years ago

What are the current differences between this library and the standard, in terms of behavior?

The only difference is https://github.com/mathiasbynens/utf8.js/issues/3.

mathiasbynens commented 9 years ago

3 is now fixed, so go ahead, @devongovett!

One thing that would be nice is backward compatibility with older browsers. Obviously IE6 won’t support typed arrays but it would be nice if utf8.js could fall back to byte strings (as currently used) gracefully. Thoughts?

devongovett commented 9 years ago

How about just using normal JS arrays if typed arrays aren't available? Or we could just skip the typed arrays entirely. The encoder doesn't really know how big to make the buffer ahead of time (unless we go through the string twice, once before allocating the buffer, and once after) anyway, so the easiest way to write it would be to use a normal resizable JS array internally before converting it to a typed array at the end. I'm not sure how much of a performance benefit returning typed arrays would have then. We could just always return a JS array, and if the consumer of the library wants a typed array, they can easily convert it themselves. What do you think?

mathiasbynens commented 9 years ago

Sounds good to me.

MicahZoltu commented 7 years ago

What is the status of this? I have a byte array I received off the wire and I would like to be able to just pass it directly to this function without having to make a copy that turns each byte into an escaped hex value in a string.

samal-rasmussen commented 7 years ago

Alright. I need this. So I took a stab at implementing it https://github.com/mathiasbynens/utf8.js/pull/28

wmertens commented 7 years ago

I am wondering if this could be an efficient way to store binary data as UTF-8 strings, where UTF-8 is allowed but binary not.

So given a bunch of binary data, convert it to a valid UTF-8 string, escaping invalid sequences and add padding + padcount at the end. If the binary data happens to be a valid UTF-8 string, it would be stored with 1 byte overhead (padcount), and if the binary data is FEFFFEFF... I suppose it would escape every byte :)

Sort of idle musing, I suppose that any space savings are dwarfed by the CPU overhead.