BOM is not removed when converting inflated output to string

yuretz commented 9 years ago

It looks like pako is ignoring BOM and leaves it where it was when converting the inflated buffer to string. I think it would be better if BOM was either removed automatically, or, at least there was an option to do this, because it's the source of those hard-to-catch kind of bugs, when things sometimes look differently from what they are in reality. Please, run the following code snippet to see what I mean.

// deflate UTF-8 encoded "hello" with BOM mark
var buf = pako.gzip([0xEF, 0xBB, 0xBF, 0x68, 0x65, 0x6C, 0x6C, 0x6F]);

// inflate it again
var str = pako.inflate(new Uint8Array(buf), {to: 'string'});

console.log('The result string looks like this: "' + str + '"' );
console.log('The first character is: 0x' + str.charCodeAt(0).toString(16) +
  ' and not 0x' + "h".charCodeAt(0).toString(16) + '("h") as you would expect');
// Produced output:
//
// The result string looks like this: "hello"
// The first character is: 0xfeff and not 0x68("h") as you would expect

puzrin commented 9 years ago

It's out of pako responsibility to cheat with text content. It operates with binary data. Everything else should be done externally.

yuretz commented 9 years ago

Thanks for your reply. Since pako already provides automatic conversion from UTF-8 to js strings, in some way it could be qualified as "cheating with the text content". Besides that, I believe people rarely expect getting some weird, invisible characters in the start of their normal javascript strings. And it's completely up to the encoder whether to include BOM or not, so this thing might bite you in a bit unexpected way.

Well, it's of course totally up to you to decide if it's a bug or not, but would it be possible to maybe include a small note about BOM decoding in the documentation, so that people don't fall in the same trap as I did?

puzrin commented 9 years ago

IMHO that's like a notice "don't dry your cat in microwave owen". I don't like to add such things.

yuretz commented 9 years ago

:) Well I don't agree these are on the same level, but OK.

yuretz commented 9 years ago

I was thinking a bit more about this issue, and I still think that it's still would be an advantage for the users of your library to provide a little bit more details on how utf-8 to utf-16 conversion is done when using {to: 'string'} option, both in general, and in particular regarding how byte order mark in the beginning is treated. If you check out relevant RFCs 3629 and 2781 you may notice that BOM character may be and in some case is suggested to be stripped since it's not considered to be a part of the object being encoded/decoded.

IMHO, regardless of what you decide to do (or not to do) about it, wouldn't it be nice to just let the users of your library know about what happens to their utf-8 encoded data when you do this conversion (especially when you use your custom-written conversion routine, and not any standard well-specified language facility, or widely used library)? Or do you expect them to guess how you do it?

Do you seriously think that writing couple of sentences in your documentation regarding this small but important detail of you library will make it look bad or silly?

puzrin commented 9 years ago

Do you seriously think that writing couple of sentences in your documentation regarding this small but important detail of you library will make it look bad or silly?

Yes. utf8 <-> utf16 conversion is standard, and it looks silly for me to exaplain such things in readme. Also, it's obvious that deflate <-> inflate <-> deflate <-> inflate should produce the same result, and BOM stripping will now allow that.

nodeca / pako

BOM is not removed when converting inflated output to string #62