Closed yuretz closed 9 years ago
It's out of pako responsibility to cheat with text content. It operates with binary data. Everything else should be done externally.
Thanks for your reply. Since pako already provides automatic conversion from UTF-8 to js strings, in some way it could be qualified as "cheating with the text content". Besides that, I believe people rarely expect getting some weird, invisible characters in the start of their normal javascript strings. And it's completely up to the encoder whether to include BOM or not, so this thing might bite you in a bit unexpected way.
Well, it's of course totally up to you to decide if it's a bug or not, but would it be possible to maybe include a small note about BOM decoding in the documentation, so that people don't fall in the same trap as I did?
IMHO that's like a notice "don't dry your cat in microwave owen". I don't like to add such things.
:) Well I don't agree these are on the same level, but OK.
I was thinking a bit more about this issue, and I still think that it's still would be an advantage for the users of your library to provide a little bit more details on how utf-8 to utf-16 conversion is done when using {to: 'string'}
option, both in general, and in particular regarding how byte order mark in the beginning is treated. If you check out relevant RFCs 3629 and 2781 you may notice that BOM character may be and in some case is suggested to be stripped since it's not considered to be a part of the object being encoded/decoded.
IMHO, regardless of what you decide to do (or not to do) about it, wouldn't it be nice to just let the users of your library know about what happens to their utf-8 encoded data when you do this conversion (especially when you use your custom-written conversion routine, and not any standard well-specified language facility, or widely used library)? Or do you expect them to guess how you do it?
Do you seriously think that writing couple of sentences in your documentation regarding this small but important detail of you library will make it look bad or silly?
Do you seriously think that writing couple of sentences in your documentation regarding this small but important detail of you library will make it look bad or silly?
Yes. utf8 <-> utf16 conversion is standard, and it looks silly for me to exaplain such things in readme. Also, it's obvious that deflate <-> inflate <-> deflate <-> inflate should produce the same result, and BOM stripping will now allow that.
It looks like pako is ignoring BOM and leaves it where it was when converting the inflated buffer to string. I think it would be better if BOM was either removed automatically, or, at least there was an option to do this, because it's the source of those hard-to-catch kind of bugs, when things sometimes look differently from what they are in reality. Please, run the following code snippet to see what I mean.