Encoding issues with Russian, Arabic and Hebrew

ghost commented 9 years ago

Hello,

I'm using msgpack-js-browser: https://github.com/creationix/msgpack-js-browser

I added some unit tests, here is the full code:

        var tests = [
          "$", "¢", "€", // Money sign
          "임창진", "おはよう", "您好！", // Korean/Japanese/Chineese(Simplified)
          "مرحبا!" , "صبح به خیر", "گڈ مارننگ", // Arabic/Persian/Urdu
          "À toute à l'heure chérie, je t'aime !", // French
          "Здравствуйте!", "добрае Раніца", "добро утро", "Қайырлы Таң", "Добро утро", "Өглөөний мэнд", "Добро јутро", "субҳ Ба Хайр", // Russian/Belarusian/Bulgarian/Kazakh/Macedonian/Mongolian/Serbian/Tajik
          "குட் மார்னிங்", // Tamil
          "గుడ్ మార్నింగ్", // Telugu
          "สวัสดี", // Thai
          "সুপ্রভাত", "गुड मॉर्निंग", "អរុណសួស្តី", "မင်္ဂလာနံနက်ခင်းပါ", "සුභ උදෑසනක්", // Bengali/Hindie/Khmer/Myanmar/Sinhala
          "सुप्रभात", // Nepali
          "ດີຕອນເຊົ້າ", // Lao
          "ਸ਼ੁਭ ਸਵੇਰ", // Punjabi
          "സുപ്രഭാതം", // Malayasam
          "शुभप्रभात", // Marathi
          "დილა", // Georgian
          "Ụtụtụ Ọma", // Igbo
          "ಗುಡ್ ಮಾರ್ನಿಂಗ್", // Kannada
          "καλημέρα", // Greek
          "ગુડ મોર્નિંગ", // Gujarati
          "בוקר טוב", "אַ גוטנ מאָרגן", // Hebrew/Jidish
          "\/#@',.:)(„”’[å»ÛÁØ]–…∞", // Custom characters
          "\xF0\x9F\x98\x81", "\xF0\x9F\x98\x84", "😁", // Emoji
          true, false, null, undefined,
          0, 1, -1, 2, -2, 4, -4, 6, -6,
          0x10, -0x10, 0x20, -0x20, 0x40, -0x40,
          0x80, -0x80, 0x100, -0x100, 0x200, -0x200,
          0x1000, -0x1000, 0x10000, -0x10000,
          0x20000, -0x20000, 0x40000,-0x40000,
          10, 100, 1000, 10000, 100000, 1000000,
          -10, -100, -1000, -10000, -100000, -1000000,
          'hello', 'world', Buffer('Hello'), Buffer('World'),
          [1,2,3], [], {name: 'Tim', age: 29}, {},
          {a: 1, b: 2, c: [1, 2, 3]}, [[],[]]
        ];

screen shot 2015-02-04 at 01 58 19

As you can see, everything is working perfectly except:

Arabic/Persian/Urdu
Russian/Belarusian/Bulgarian/Kazakh/Macedonian/Mongolian/Serbian/Tajik
Hebrew/Jidish

My question is, is it a problem due to msgpack-js-browser or msgpack implementation ? And if this is msgpack, can these languages support will be added in the future ?

Thanks.

redboltz commented 9 years ago

Hi @pwnsdx , msgpack-c is C and C++ version of msgpack implementation. Are you using msgpack-c?

ghost commented 9 years ago

Hello redboltz,

Thanks for answering. No I don't use msgpack-c. https://github.com/creationix/msgpack-js-browser looks dead since 2 years. Assuming that msgpack-js-browser respect the standard, I'm posting here to know if msgpack-c have the same problem.

redboltz commented 9 years ago

I'm not sure about the msgpack-js-browser. If ArrayBuffer's hex dump is not the same as expected value, you can do the following tests:

Prepare the expected ArrayBuffer, and pack/unpack it. If they aren't the same, the library would have a problem.
Prepare the Arabic string, then set a breakpoint just before packing. If packing buffer is not the same as expected ArrayBuffer, encoding string logic would have a problem.
Prepare the expected ArrayBuffer, then decode it as string. If the string is not the same as expected, decoding string logic or view would have a problem.

Again, I don't know which is a part of msgpack-js-browser library.

Note:

msgpack-c supports the following msgpack format: https://github.com/msgpack/msgpack/blob/master/spec.md

msgpack-c is mapping std::string to str and std::vector to bin. str is expected as utf-8 string, but msgpack-c doesn't do any special treats. Keeping str or std::string utf-8 is client responsibility.

ghost commented 9 years ago

After doing what you said, it looks like the problem is here:

    Compressor.utf8 = {

        write: function(view, offset, string)
        {
            var byteLength = view.byteLength;

            for(var i = 0, l = string.length; i < l; i++) {

                var codePoint = string.charCodeAt(i);

                // One byte of UTF-8
                if (codePoint < 0x80) {
                    view.setUint8(offset++, codePoint >>> 0 & 0x7f | 0x00);
                    continue;
                }

                // Two bytes of UTF-8
                if (codePoint < 0x800) {
                    view.setUint8(offset++, codePoint >>> 6 & 0x1f | 0xc0);
                    view.setUint8(offset++, codePoint >>> 0 & 0x3f | 0x80);
                    continue;
                }

                // Three bytes of UTF-8
                if (codePoint < 0x10000) {
                    view.setUint8(offset++, codePoint >>> 12 & 0x0f | 0xe0);
                    view.setUint8(offset++, codePoint >>> 6  & 0x3f | 0x80);
                    view.setUint8(offset++, codePoint >>> 0  & 0x3f | 0x80);
                    continue;
                }

                // Four bytes of UTF-8
                if (codePoint < 0x110000) {
                    view.setUint8(offset++, codePoint >>> 18 & 0x07 | 0xf0);
                    view.setUint8(offset++, codePoint >>> 12 & 0x3f | 0x80);
                    view.setUint8(offset++, codePoint >>> 6  & 0x3f | 0x80);
                    view.setUint8(offset++, codePoint >>> 0  & 0x3f | 0x80);
                    continue;
                }

                console.error('Bad codepoint ' + codePoint);
            }
        },

        read: function(view, offset, length)
        {
            var string = '';

            for(var i = offset, end = offset + length; i < end; i++) {

                var byte = view.getUint8(i);

                // One byte character
                if ((byte & 0x80) === 0x00) {
                    string += String.fromCharCode(byte);
                    continue;
                }

                // Two byte character
                if ((byte & 0xe0) === 0xc0) {
                    string += String.fromCharCode(
                        ((byte & 0x0f) << 6) | 
                        (view.getUint8(++i) & 0x3f)
                    );
                    continue;
                }

                // Three byte character
                if ((byte & 0xf0) === 0xe0) {
                    string += String.fromCharCode(
                        ((byte & 0x0f) << 12) |
                        ((view.getUint8(++i) & 0x3f) << 6) |
                        ((view.getUint8(++i) & 0x3f) << 0)
                    );
                    continue;
                }

                // Four byte character
                if ((byte & 0xf8) === 0xf0) {
                    string += String.fromCharCode(
                        ((byte & 0x07) << 18) |
                        ((view.getUint8(++i) & 0x3f) << 12) |
                        ((view.getUint8(++i) & 0x3f) << 6) |
                        ((view.getUint8(++i) & 0x3f) << 0)
                    );
                    continue;
                }

                console.error('Invalid byte ' + byte.toString(16));
            }

            return string;
        },

        count: function(string)
        {
            var count = 0;

            for(var i = 0, l = string.length; i < l; i++) {

                var codePoint = string.charCodeAt(i);

                if (codePoint < 0x80) {
                    count += 1;
                    continue;
                }

                if (codePoint < 0x800) {
                    count += 2;
                    continue;
                }

                if (codePoint < 0x10000) {
                    count += 3;
                    continue;
                }

                if (codePoint < 0x110000) {
                    count += 4;
                    continue;
                }

                console.error('Bad codepoint ' + codePoint);
            }

            return count;
        }
    };

I will try to figure out why it is not working as expected. Thanks for the tips.

ghost commented 9 years ago

I switched to BinaryPack. https://github.com/binaryjs/js-binarypack https://github.com/binaryjs/node-binarypack

Everything works fine now. Thanks for your help.

msgpack / msgpack-c

Encoding issues with Russian, Arabic and Hebrew #213