readUTF8StringBytes doesn't decode characters larger than 16 bits correctly

pwerry commented 10 years ago

Simple to reproduce - write an emoji that is larger than 16 bits (such as a smiley face) into a ByteBuffer using writeVString(). Then try to read it out using readVString(). The character is corrupted - seems that String.fromCharCode() isn't doing the right thing here. Note that if you (in node) call toBuffer().toString() after writing the emoji to the ByteBuffer, it will be internally correct (writing works correctly).

Workaround in node is to replace readVString with an implementation that reads the size and converts a slice to a native Buffer.

cawilson commented 10 years ago

I've noticed as well that UTF-8 encoding is not handled correctly and have subsequently modified the library locally to fix the issue. Just didn't have time lately to post what I did, so here goes:

To start with, the ECMA 6 specification defines two new String methods for correctly converting a Unicode character to it's code point and a Unicode code point to a String. At the time of writing, only Firefox (aka Aurora) version 29 implements the APIs natively:

String.fromCodePoint:
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
String.prototype.codePointAt:
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt

For everyone else, there are shims available. I used the following for my solution:

String.fromCodePoint:
- http://mths.be/fromcodepoint
String.prototype.codePointAt:
- http://mths.be/codepointat

These are required because the existing String.charCodeAt and String.fromCharCode does not handle all Unicode characters properly. This is because JavaScript stores characters that are bigger than 0xFFFF as two character codes. These must interpreted as a pair and that's where the existing String.charCodeAt and String.fromCharCode methods fail. They only look at each individual character.

You can see this by creating a JavaScript string like this:

var testString = "\ud834\udd1e";

If you print this string you will get one character which is the Musical Symbol G Clef but the length of the string is actually 2. The actual value of character is 0x1D11E but must be stored as surrogate pairs, hence 0xD834 and 0xDD1E.

To solve this, I made the following changes to properly support all valid UTF-8 character codes. This relies on both the String.fromCodePoint and String.prototype.codePointAt implementions mentioned above.

ByteBuffer.prototype.writeUTF8String():

// Existing:
ByteBuffer.prototype.writeUTF8String = function(str, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var start = offset;
    var encLen = ByteBuffer.calculateUTF8String(str); // See [1]
    this.ensureCapacity(offset+encLen);
    for (var i=0, j=str.length; i<j; ++i) {
        // [1] Does not throw since JS strings are already UTF8 encoded
        offset += ByteBuffer.encodeUTF8Char(str.charCodeAt(i), this, offset);
    }
    if (advance) {
        this.offset = offset;
        return this;
    } else {
        return offset-start;
    }
};

// Revised:
ByteBuffer.prototype.writeUTF8String = function (str, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var start = offset;
    var encLen = ByteBuffer.calculateUTF8String(str);
    var charCode;
    this.ensureCapacity(offset+encLen);
    for (var i = 0, j = str.length; i < j; ++i) {
        // Use the native API to read the current charcters char
        // code and then test to see whether it's a leading/trailing
        // UTF-16 surrogate. We use the native API first due to
        // the slowness of using the Shim for reading the code
        // point.
        charCode = str.charCodeAt(i);
        if (charCode >= 0xD800 && charCode <= 0xDFFF) {
            // If this is a possible leading/trailing surrogate,
            // use the Shim to read the code point (if possible).
            charCode = str.codePointAt(i);
            // If the code point is within the valid range, skip the
            // next character, as JavaScript strings contain both the
            // leading and trailing surrogates and we don't need to
            // write the leading and trailing surrogates as we have
            // the full code point to write.
            if (charCode > 0xFFFF) {
                i++;
            }
        }
        offset += ByteBuffer.encodeUTF8Char(charCode, this, offset);
    }
    if (advance) {
        this.offset = offset;
        return this;
    } else {
        return offset-start;
    }
};

ByteBuffer.prototype.readUTF8String():

// Existing:
ByteBuffer.prototype.readUTF8String = function(chars, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var dec, result = "", start = offset;
    for (var i=0; i<chars; ++i) {
        dec = ByteBuffer.decodeUTF8Char(this, offset);
        offset += dec["length"];
        result += String.fromCharCode(dec["char"]);
    }
    if (advance) {
        this.offset = offset;
        return result;
    } else {
        return {
            "string": result,
            "length": offset-start
        }
    }
};

// Revised:
ByteBuffer.prototype.readUTF8String = function(chars, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var dec, charCode, result = "", start = offset;
    for (var i=0; i<chars; ++i) {
        dec = ByteBuffer.decodeUTF8Char(this, offset);
        offset += dec["length"];
        // Use the native API for characters less than or
        // equal to 0xFFFF as the native API is faster than
        // the Shim.
        charCode = dec["char"];
        if (charCode <= 0xFFFF) {
            result += String.fromCharCode(charCode);
        } else {
            result += String.fromCodePoint(charCode);
        }
    }
    if (advance) {
        this.offset = offset;
        return result;
    } else {
        return {
            "string": result,
            "length": offset-start
        }
    }
};

ByteBuffer.prototype.readUTF8StringBytes():

// Existing:
ByteBuffer.prototype.readUTF8StringBytes = function(length, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var dec, result = "", start = offset;
    length = offset + length; // Limit
    while (offset < length) {
        dec = ByteBuffer.decodeUTF8Char(this, offset);
        offset += dec["length"];
        result += String.fromCharCode(dec["char"]);
    }
    if (offset != length) {
        throw(new Error("Actual string length differs from the specified: "+((offset>length ? "+" : "")+offset-length)+" bytes"));
    }
    if (advance) {
        this.offset = offset;
        return result;
    } else {
        return {
            "string": result,
            "length": offset-start
        }
    }
};

// Revised:
ByteBuffer.prototype.readUTF8StringBytes = function(length, offset) {
    var advance = typeof offset === 'undefined';
    offset = typeof offset !== 'undefined' ? offset : this.offset;
    var dec, charCode, result = "", start = offset;
    length = offset + length; // Limit
    while (offset < length) {
        dec = ByteBuffer.decodeUTF8Char(this, offset);
        offset += dec["length"];
        // Use the native API for characters less than or
        // equal to 0xFFFF as the native API is faster than
        // the Shim.
        charCode = dec["char"];
        if (charCode <= 0xFFFF) {
            result += String.fromCharCode(charCode);
        } else {
            result += String.fromCodePoint(charCode);
        }
    }
    if (offset != length) {
        throw(new Error("Actual string length differs from the specified: "+((offset>length ? "+" : "")+offset-length)+" bytes"));
    }
    if (advance) {
        this.offset = offset;
        return result;
    } else {
        return {
            "string": result,
            "length": offset-start
        }
    }
};

Additionally, I wanted to know whether the string was actually valid UTF-8. UTF-8 strings should not contain leading or trailing UTF-16 surrogates without being in pairs. These are the code points between and including 0xD800 to 0xDFFF. I therefore threw an exception, so that I can handle it appropriately (this could mean fixing the string or encoding as UTF-16 and allowing the invalid code points).

ByteBuffer.calculateUTF8String():

// Existing:
ByteBuffer.calculateUTF8String = function(str) {
    str = ""+str;
    var bytes = 0;
    for (var i=0, k=str.length; i<k; ++i) {
        // Does not throw since JS strings are already UTF8 encoded
        bytes += ByteBuffer.calculateUTF8Char(str.charCodeAt(i));
    }
    return bytes;
};

// Revised:
ByteBuffer.calculateUTF8String = function(str) {
    str = ""+str;
    var bytes = 0, charCode;
    for (var i = 0, k = str.length; i < k; ++i) {
        // Use the native API to read the current charcter code
        // and then test to see whether it's a leading/trailing
        // UTF-16 surrogate. We use the native API first due to
        // the slowness of using the Shim for reading the code
        // point.
        charCode = str.charCodeAt(i);
        if (charCode >= 0xD800 && charCode <= 0xDFFF) {
            // If this is a possible leading/trailing surrogate,
            // use the Shim to read the code point (if possible).
            charCode = str.codePointAt(i);
            // If the code point is within the valid range, skip the
            // next character, as JavaScript strings contain both the
            // leading and trailing surrogates.
            if (charCode >= 0x10000) {
                i++;
            } else if (charCode >= 0xD800 && charCode <= 0xDFFF) {
                // If we failed to decode the code point, this is an invalid
                // UTF-8 character and therefore an invalid string.
                throw (new Error("Cannot calculate length of UTF8 character: charCode (" + charCode + ") is invalid without its surrogate pair"));
            }
        }
        bytes += ByteBuffer.calculateUTF8Char(charCode);
    }
    return bytes;
};

Lastly, according to the Wiki page for UTF-8 encoding (http://en.wikipedia.org/wiki/UTF-8):

The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

Therefore I revised the encode and decode functions to only handle characters up to 0x10FFFF:

ByteBuffer.decodeUTF8Char():

// Existing:
ByteBuffer.decodeUTF8Char = function(src, offset) {
    var a = src.readUint8(offset), b, c, d, e, f, start = offset, charCode;
    // ref: http://en.wikipedia.org/wiki/UTF-8#Description
    // It's quite huge but should be pretty fast.
    if ((a&0x80)==0) {
        charCode = a;
        offset += 1;
    } else if ((a&0xE0)==0xC0) {
        b = src.readUint8(offset+1);
        charCode = ((a&0x1F)<<6) | (b&0x3F);
        offset += 2;
    } else if ((a&0xF0)==0xE0) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        charCode = ((a&0x0F)<<12) | ((b&0x3F)<<6) | (c&0x3F);
        offset += 3;
    } else if ((a&0xF8)==0xF0) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        d = src.readUint8(offset+3);
        charCode = ((a&0x07)<<18) | ((b&0x3F)<<12) | ((c&0x3F)<<6) | (d&0x3F);
        offset += 4;
    } else if ((a&0xFC)==0xF8) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        d = src.readUint8(offset+3);
        e = src.readUint8(offset+4);
        charCode = ((a&0x03)<<24) | ((b&0x3F)<<18) | ((c&0x3F)<<12) | ((d&0x3F)<<6) | (e&0x3F);
        offset += 5;
    } else if ((a&0xFE)==0xFC) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        d = src.readUint8(offset+3);
        e = src.readUint8(offset+4);
        f = src.readUint8(offset+5);
        charCode = ((a&0x01)<<30) | ((b&0x3F)<<24) | ((c&0x3F)<<18) | ((d&0x3F)<<12) | ((e&0x3F)<<6) | (f&0x3F);
        offset += 6;
    } else {
        throw(new Error("Cannot decode UTF8 character at offset "+offset+": charCode (0x"+a.toString(16)+") is invalid"));
    }
    return {
        "char": charCode ,
        "length": offset-start
    };
};

// Revised:
ByteBuffer.decodeUTF8Char = function(src, offset) {
    var a = src.readUint8(offset), b, c, d, start = offset, charCode;
    // ref: http://en.wikipedia.org/wiki/UTF-8#Description
    // It's quite huge but should be pretty fast.
    if ((a&0x80)==0) {
        charCode = a;
        offset += 1;
    } else if ((a&0xE0)==0xC0) {
        b = src.readUint8(offset+1);
        charCode = ((a&0x1F)<<6) | (b&0x3F);
        offset += 2;
    } else if ((a&0xF0)==0xE0) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        charCode = ((a&0x0F)<<12) | ((b&0x3F)<<6) | (c&0x3F);
        offset += 3;
    } else if ((a&0xF8)==0xF0) {
        b = src.readUint8(offset+1);
        c = src.readUint8(offset+2);
        d = src.readUint8(offset+3);
        charCode = ((a&0x07)<<18) | ((b&0x3F)<<12) | ((c&0x3F)<<6) | (d&0x3F);
        offset += 4;
    } else {
        throw(new Error("Cannot decode UTF8 character at offset "+offset+": charCode (0x"+a.toString(16)+") is invalid"));
    }
    return {
        "char": charCode ,
        "length": offset-start
    };
};

ByteBuffer.decodeUTF8Char():

// Existing:
ByteBuffer.encodeUTF8Char = function(charCode, dst, offset) {
    var start = offset;
    // ref: http://en.wikipedia.org/wiki/UTF-8#Description
    // It's quite huge but should be pretty fast.
    if (charCode < 0) {
        throw(new Error("Cannot encode UTF8 character: charCode ("+charCode+") is negative"));
    }
    if (charCode < 0x80) {
        dst.writeUint8(charCode&0x7F, offset);
        offset += 1;
    } else if (charCode < 0x800) {
        dst.writeUint8(((charCode>>6)&0x1F)|0xC0, offset)
            .writeUint8((charCode&0x3F)|0x80, offset+1);
        offset += 2;
    } else if (charCode < 0x10000) {
        dst.writeUint8(((charCode>>12)&0x0F)|0xE0, offset)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+1)
            .writeUint8((charCode&0x3F)|0x80, offset+2);
        offset += 3;
    } else if (charCode < 0x200000) {
        dst.writeUint8(((charCode>>18)&0x07)|0xF0, offset)
            .writeUint8(((charCode>>12)&0x3F)|0x80, offset+1)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+2)
            .writeUint8((charCode&0x3F)|0x80, offset+3);
        offset += 4;
    } else if (charCode < 0x4000000) {
        dst.writeUint8(((charCode>>24)&0x03)|0xF8, offset)
            .writeUint8(((charCode>>18)&0x3F)|0x80, offset+1)
            .writeUint8(((charCode>>12)&0x3F)|0x80, offset+2)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+3)
            .writeUint8((charCode&0x3F)|0x80, offset+4);
        offset += 5;
    } else if (charCode < 0x80000000) {
        dst.writeUint8(((charCode>>30)&0x01)|0xFC, offset)
            .writeUint8(((charCode>>24)&0x3F)|0x80, offset+1)
            .writeUint8(((charCode>>18)&0x3F)|0x80, offset+2)
            .writeUint8(((charCode>>12)&0x3F)|0x80, offset+3)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+4)
            .writeUint8((charCode&0x3F)|0x80, offset+5);
        offset += 6;
    } else {
        throw(new Error("Cannot encode UTF8 character: charCode (0x"+charCode.toString(16)+") is too large (>= 0x80000000)"));
    }
    return offset-start;
};

// Revised:
ByteBuffer.encodeUTF8Char = function(charCode, dst, offset) {
    var start = offset;
    // ref: http://en.wikipedia.org/wiki/UTF-8#Description
    // It's quite huge but should be pretty fast.
    if (charCode < 0) {
        throw(new Error("Cannot encode UTF8 character: charCode ("+charCode+") is negative"));
    }
    if (charCode < 0x80) {
        dst.writeUint8(charCode&0x7F, offset);
        offset += 1;
    } else if (charCode < 0x800) {
        dst.writeUint8(((charCode>>6)&0x1F)|0xC0, offset)
            .writeUint8((charCode&0x3F)|0x80, offset+1);
        offset += 2;
    } else if (charCode < 0x10000) {
        dst.writeUint8(((charCode>>12)&0x0F)|0xE0, offset)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+1)
            .writeUint8((charCode&0x3F)|0x80, offset+2);
        offset += 3;
    } else if (charCode < 0x110000) {
        dst.writeUint8(((charCode>>18)&0x07)|0xF0, offset)
            .writeUint8(((charCode>>12)&0x3F)|0x80, offset+1)
            .writeUint8(((charCode>>6)&0x3F)|0x80, offset+2)
            .writeUint8((charCode&0x3F)|0x80, offset+3);
        offset += 4;
    } else {
        throw(new Error("Cannot encode UTF8 character: charCode (0x"+charCode.toString(16)+") is too large (>= 0x110000)"));
    }
    return offset-start;
};

ByteBuffer.calculateUTF8Char():

// Existing:
ByteBuffer.calculateUTF8Char = function(charCode) {
    if (charCode < 0) {
        throw(new Error("Cannot calculate length of UTF8 character: charCode ("+charCode+") is negative"));
    }
    if (charCode < 0x80) {
        return 1;
    } else if (charCode < 0x800) {
        return 2;
    } else if (charCode < 0x10000) {
        return 3;
    } else if (charCode < 0x200000) {
        return 4;
    } else if (charCode < 0x4000000) {
        return 5;
    } else if (charCode < 0x80000000) {
        return 6;
    } else {
        throw(new Error("Cannot calculate length of UTF8 character: charCode (0x"+charCode.toString(16)+") is too large (>= 0x80000000)"));
    }
};

// Revised:
ByteBuffer.calculateUTF8Char = function(charCode) {
    if (charCode < 0) {
        throw(new Error("Cannot calculate length of UTF8 character: charCode ("+charCode+") is negative"));
    }
    if (charCode < 0x80) {
        return 1;
    } else if (charCode < 0x800) {
        return 2;
    } else if (charCode < 0x10000) {
        return 3;
    } else if (charCode < 0x110000) {
        return 4;
    } else {
        throw(new Error("Cannot calculate length of UTF8 character: charCode (0x"+charCode.toString(16)+") is too large (>= 0x110000)"));
    }
};

Hopefully that all makes sense and also helps you with your problem as well.

dcodeIO commented 10 years ago

I've just commited ByteBuffer.js 3 which also takes this issue into account. I would be very happy if you could find some time to test it.

adambom commented 10 years ago

I'm seeing a problem with the cloud emoji (among others)

It gets parsed to this: "☁️". Notice, this is a string of length 2. The first character corresponds to the emoji we want, e.g. "☁️".charCodeAt(0).toString(16) is "fe0f". The second corresponds to a nonsense character.

cawilson commented 10 years ago

In this case, the code point length of 2 is correct. The cloud emoji is represented by Unicode 2601 (the nonsensical character you referred to). This is then followed by a variation character which is FE0F in this case. According to the page you linked and the Unicode website (http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html) you can also supply the variation FE0E to a text style cloud (if supported). I've noticed you can omit the variation character and the browser will still display the cloud but you probably should supply it anyway.

You have to keep in mind that the length of a JavaScript string is really the amount of code points and not length of the visual (printable) string. I can see how this leads to confusion. Hopefully this will be easier to determine with ECMA 6, which is supposed to support Unicode characters properly.

cawilson commented 10 years ago

I've performed some tests and I'm pleased to say that version 3 of the library has fixed this issue for me :+1:

protobufjs / bytebuffer.js

readUTF8StringBytes doesn't decode characters larger than 16 bits correctly #19