twitchdev / issues

Issue tracker for third party developers.
Apache License 2.0
72 stars 6 forks source link

Emote index sometimes out of bounds for given message #104

Open RAnders00 opened 4 years ago

RAnders00 commented 4 years ago

Brief description My service sometimes receives messages which have an emote= tag that specifies indices that go out of bounds for the given message string.

Examples:

More examples of messages that fail to parse correctly:

How to reproduce Join a lot of channels, and check each message for this condition. (emote end index >= message length)

Expected behavior The emote indices are correct.

Screenshots I believe I have even seen an effect of this when using normal chat, not using the Chat API. The emotes will be misplaced over the text, with the space next to them being consumed and one letter being left standing there.

Additional context or questions I suspect this has to do with the accents/unicode characters people used in these messages. I have yet to see a malformed emotes tag without accents somewhere in the message. Also note that I'm really only able to detect messages where the emote tag overruns at the end. There's probably a lot more malformed messages floating around that have the misaligned emote tag for an emote not at the end of the message.

Also, https://github.com/robotty/dank-twitch-irc/issues/22 - Linked issue.

BarryCarlyon commented 4 years ago

https://discuss.dev.twitch.tv/t/wrong-format-text-on-request-api/25519 might be related?

pajlada commented 4 years ago

Ran into the same issue in https://github.com/gempir/go-twitch-irc/issues/140 - we didn't manage to capture the specific message that caused it but it seems to have happened twice in the same message (so the emote indices were +2)

RAnders00 commented 4 years ago

Had to work around this issue today again: https://github.com/robotty/twitch-irc-rs/commit/6195c31d74d00b3e4e33e534c141779b7bbe3c57

lleadbet commented 3 years ago

Filed internally as MES-6178.

Xemdo commented 1 year ago

Explained further internally in SUBS-12389 Some background for those curious: the positions given are based on UTF-8 bytes and not by runes. This means e would be a length of 1 due to it being just UTF-8 0xC3, while è would be a length of 2 because it's UTF-8 code is 0xC3 0xA8. Basically you'll only ever see this issue occur when non-English alphabet characters are used.

Marking this as related to documentation, as this behavior needs to be documented rather than fixed. Fixing it would break pretty much every chat integration.

skyboy commented 10 months ago

Explained further internally in SUBS-12389 Some background for those curious: the positions given are based on UTF-8 bytes and not by runes. This means e would be a length of 1 due to it being just UTF-8 0xC3, while è would be a length of 2 because it's UTF-8 code is 0xC3 0xA8. Basically you'll only ever see this issue occur when non-English alphabet characters are used.

Marking this as related to documentation, as this behavior needs to be documented rather than fixed. Fixing it would break pretty much every chat integration.

First: no chat integration handles this error in a way that might break from it being fixed; every snippet of code I've seen has been simply avoiding a crash from indexing an array out of bounds. You can see the general way people deal with it at this line: https://github.com/gempir/go-twitch-irc/commit/8310a10fee4f16800ac9fc941bef20aeb05820a5#diff-f2573c8adbe08fce0a949dfbdb9f8653ed0e80b1ee5d9562e885e9a03357c302R178

Wherein they label the expected output of their test cases (which would not fail if you fix this, to note) and none of the emote codes listed in the test cases are correct and no efforts are made to adjust the emote indices to be correct.


After extensive testing, this description of the error that's occurring is either woefully incomplete, or entirely misguided. What's actually happening appears to be that several AutoMod actions are offsetting the returned indices by 1, likely by including one extra character in their message fragment and is entirely unrelated to UTF8 encoding at our end since that would generate offsets larger than 1. What's worse is that some do not do this, and one I found actually breaks emotes that occur after it by what appears to be deleting its message fragment thereby shifting all indices to the left by the UTF8 length of its fragment, but still allowing it to be emitted as a response. This last one does not induce any off-by-one errors, however.

I have some test cases I've accumulated from project's test cases to avoid regressing, and a snippet of incomplete ES6 code that demonstrates what I described by correcting the indices:


console.log(
[
    [ // this first example demonstrates clearly that UTF-8 encoding is not to blame, because nearly every character here is 2 bytes. the emote is off by: 1
        // message text
        "Я не такой красивый. Не урод, но до тебя далеко LUL",

        { // message flags
            s: 24,
            e: 28,
            t: 'A.3'
        },
        // an emote
        { s: 49, e: 51 }
    ],
    [
        "Då kan du begära skadestånd och förtal Kappa",

        { // flags
            s: 17,
            e: 26,
            t: 'S.6'
        },
        { s: 40, e: 44 }
    ],
    [
        "pensé que no habría directo que bueno que si staryuukiLove staryuukiLove staryuukiLove staryuukiBits",

        { // flags
            s: 0,
            e: 4,
            t: 'S.6'
        },
        { s: 46, e: 58 }
    ],
    [ // this is a malformed message i found in one project's test case; that it works proves the point about UTF-8 not being the primary problem
        "╔ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne╔",

        { // flags
            s: 22,
            e: 27,
            t: 'S.5'
        },
        { s: 30, e: 39 }
    ],
    [ // correctly formed of above
        "\u0001ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne\u0001",

        { // flags
            s: 22,
            e: 27,
            t: 'S.5'
        },
        { s: 30, e: 39 }
    ]
]. // we run a reducer to accumulate all of the results into one string
    reduce(
        (a, v) => a + (
            // the actual meat
            (msg, flg, emt) => {
                // spread the flag's message fragment into Unicode codepoints/codepairs and get the charcode (breaks on surrogate pairs, but no example messages have them)
                // the end index of JS functions is exclusive, so must be offset by 1
                let fMsg = ([...msg.slice(flg.s, flg.e + 1)].map(v=>v.charCodeAt(0))),
                    // default offset length: 1
                    oLen = 1;

                // this one actually does something strange
                if ( flg.t == 'S.5') {
                    // run the message through a reducer that will count (down) the UTF-8 encoding length of the flag's message fragment
                    oLen = fMsg.reduce(
                        (acc, val) => (
                            acc - (
                                val <= (1 <<  7) - 1 ? 1 : // encoded as:                               0xxx xxxx --  7 bits
                                val <= (1 << 11) - 1 ? 2 : // encoded as:                     110x xxxx 10xx xxxx -- 11 bits
                                val <= (1 << 16) - 1 ? 3 : // encoded as:           1110 xxxx 10xx xxxx 10xx xxxx -- 16 bits
                                /*     (1 << 21) - 1*/ 4   // encoded as: 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx -- 21 bits
                            )
                        ),
                        // initial value of `acc`
                        0 
                    );
                }

                return ( ( (fsg, usg) => {
                    const ePair = emt.s + '-' + emt.e, uPadding = " ".repeat(emt.s - 1), oPadding = " ".repeat(emt.s - oLen - 1);
                    return `Emote indices: ${ePair}
Offset Length: ${oLen}

UTF-8.Enc Message: "${usg}" (len: ${usg.length})
UTF-8.Enc Emote:   "${uPadding}^${usg.slice(emt.s, emt.e + 1)}^" ;

UTF-8ish Message:  "${fsg}" (len: ${fsg.length})
UTF-8ish Emote:    "${uPadding}^${fsg.slice(emt.s, emt.e + 1)}^" ;

Original Message:  "${msg}" (len: ${msg.length})
Original Emote:    "${uPadding}^${msg.slice(emt.s, emt.e + 1)}^" ;

Original Message:  "${msg}" (len: ${msg.length})
Orig.Offset Emote: "${oPadding}^${msg.slice(emt.s - oLen, emt.e - oLen + 1)}^"\n\n\n`
                })(
                    // encode the flag fragment with arbitrary easy-to-pick-out replacement characters indicating byte length
                    msg.slice(0, flg.s).
                        // these values are the same as the accumulator for S.5 above, just very messy.
                        // it had been like 6 hours of characterizing the issue, and this part isn't descriptive of the issue; merely demonstrative.
                        concat(fMsg.map(v=>v>127?v>0x1fff?v>0xffff?'\xa1\xa1\xa1\xa1':'\xbf\xbf\xbf':'\xb7\xb7':String.fromCharCode(v)).join('')).
                        concat(msg.slice(flg.e + 1, msg.length)),
                    // do it to the whole message
                    ([...msg].map(v=>v.charCodeAt(0))).map(v=>v>127?v>0x1fff?v>0xffff?'\xa1\xa1\xa1\xa1':'\xbf\xbf\xbf':'\xb7\xb7':String.fromCharCode(v)).join('')
                ) )
            }
        )(...v),
        ""
    )
)

With expected an output value of:

Emote indices: 49-51
Offset Length: 1

UTF-8.Enc Message: "·· ···· ·········· ················. ···· ········, ···· ···· ········ ············ LUL" (len: 87)
UTF-8.Enc Emote:   "                                                ^·, ^" ;

UTF-8ish Message:  "Я не такой красивый. Не ········, но до тебя далеко LUL" (len: 55)
UTF-8ish Emote:    "                                                ^ко ^" ;

Original Message:  "Я не такой красивый. Не урод, но до тебя далеко LUL" (len: 51)
Original Emote:    "                                                ^UL^" ;

Original Message:  "Я не такой красивый. Не урод, но до тебя далеко LUL" (len: 51)
Orig.Offset Emote: "                                               ^LUL^"

Emote indices: 40-44
Offset Length: 1

UTF-8.Enc Message: "D·· kan du beg··ra skadest··nd och f··rtal Kappa" (len: 48)
UTF-8.Enc Emote:   "                                       ^al Ka^" ;

UTF-8ish Message:  "Då kan du begära skadest··nd och förtal Kappa" (len: 45)
UTF-8ish Emote:    "                                       ^Kappa^" ;

Original Message:  "Då kan du begära skadestånd och förtal Kappa" (len: 44)
Original Emote:    "                                       ^appa^" ;

Original Message:  "Då kan du begära skadestånd och förtal Kappa" (len: 44)
Orig.Offset Emote: "                                      ^Kappa^"

Emote indices: 46-58
Offset Length: 1

UTF-8.Enc Message: "pens·· que no habr··a directo que bueno que si staryuukiLove staryuukiLove staryuukiLove staryuukiBits" (len: 102)
UTF-8.Enc Emote:   "                                             ^ staryuukiLov^" ;

UTF-8ish Message:  "pens·· que no habría directo que bueno que si staryuukiLove staryuukiLove staryuukiLove staryuukiBits" (len: 101)
UTF-8ish Emote:    "                                             ^staryuukiLove^" ;

Original Message:  "pensé que no habría directo que bueno que si staryuukiLove staryuukiLove staryuukiLove staryuukiBits" (len: 100)
Original Emote:    "                                             ^taryuukiLove ^" ;

Original Message:  "pensé que no habría directo que bueno que si staryuukiLove staryuukiLove staryuukiLove staryuukiBits" (len: 100)
Orig.Offset Emote: "                                            ^staryuukiLove^"

Emote indices: 30-39
Offset Length: -7

UTF-8.Enc Message: "¿¿¿ACTION A LOJA AINDA N··O EST·· PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne¿¿¿" (len: 94)
UTF-8.Enc Emote:   "                             ^T·· PRONTA^" ;

UTF-8ish Message:  "╔ACTION A LOJA AINDA N··O ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne╔" (len: 89)
UTF-8ish Emote:    "                             ^ PRONTA Bi^" ;

Original Message:  "╔ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne╔" (len: 88)
Original Emote:    "                             ^PRONTA Bib^" ;

Original Message:  "╔ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne╔" (len: 88)
Orig.Offset Emote: "                                    ^BibleThump^"

Emote indices: 30-39
Offset Length: -7

UTF-8.Enc Message: "�ACTION A LOJA AINDA N··O EST·· PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne�" (len: 90)
UTF-8.Enc Emote:   "                             ^· PRONTA B^" ;

UTF-8ish Message:  "�ACTION A LOJA AINDA N··O ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne�" (len: 89)
UTF-8ish Emote:    "                             ^ PRONTA Bi^" ;

Original Message:  "�ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne�" (len: 88)
Original Emote:    "                             ^PRONTA Bib^" ;

Original Message:  "�ACTION A LOJA AINDA NÃO ESTÁ PRONTA BibleThump , AGUARDE... NOVIDADES EM BREVE FortOne�" (len: 88)
Orig.Offset Emote: "                                    ^BibleThump^"

The first example message that ruled out so many options was found here: https://github.com/robotty/twitch-irc-rs/commit/6195c31d74d00b3e4e33e534c141779b7bbe3c57#diff-c921cf3d3eae3188afa513e79fe5d0473840bc687eac22f8abe31a9a397d7426R392