Crate panics when IRC Tag contain non ascii characters

olback commented 4 years ago

Here's the panic message:

thread 'main' panicked at 'byte index 96 is not a char boundary; it is inside '甲' (bytes 94..97) of `@badge-info=;badges=;client-nonce=5a0d93af08a7aec23672b859e0d1aa88;color=#00FF7F;display-name=甲森;emotes=;flags=;id=746b7296-ce51-4198-9002-5021ba9080d9;mod=0;room-id=71092938;subscriber=0;tmi-sent-ts=1604198943834;turbo=0;user-id=72760962;user-type= :g`[...]', /home/olback/.cargo/registry/src/github.com-1ecc6299db9ec823/twitchchat-0.14.5/src/irc/tag_indices.rs:91:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Version: 0.14.5

cargo --version -v:

cargo 1.47.0 (f3c7e066a 2020-08-28)
release: 1.47.0
commit-hash: f3c7e066ad66e05439cf8eab165a2de580b41aaf
commit-date: 2020-08-28

rustc --version -v:

rustc 1.47.0 (18bf6b4f0 2020-10-07)
binary: rustc
commit-hash: 18bf6b4f01a6feaf7259ba7cdae58031af1b7b39
commit-date: 2020-10-07
host: x86_64-unknown-linux-gnu
release: 1.47.0
LLVM version: 11.0

Side note: Consider adding issue templates to make it easier to report bugs/make feature request :+1:

museun commented 4 years ago

I see, was this from Twitch or were your parsing custom messages?

I should look into issue templates, all I really need is the input data (or sample), the problem and the crate version. Its in the tags which I last added string escaping for. I'll see where its getting the indices wrong. I also assume it was the display-name tag.

olback commented 4 years ago

Yes, this was from Twitch. I don't really know how a non ascii character even gets into the display-name as it has to be the same as your username, just different capitalization.

When titling the issue I assumed that all tags are parsed the same way, the issue is with display-name though.

Edit: Managed to get the user-id from one of the crashing messages, here is a partial response from the /helix/users endpoint with user in question:

{
  "data": [
    {
      "id": "86293428",
      "login": "yuebing233",
      "display_name": "月饼",
      ...
    }
  ]
}

museun commented 4 years ago

display-name can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.

Looking at the code, I don't know if I can fix this in 0.14.x. There is a fatal flaw here: https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tag_indices.rs#L39-L40

I would need to calculate the byte offset of all further 'chars' which is something I don't really want to do.

I kind of want to get rid of the the whole super-cheap indices approach. Currently, all of the messages, each, use a single allocation and then provide their 'sub-strings' as indices. These indices refer back to the single &str/String. But getting rid of this and just moving to a naive 'struct of indv. &str/String would probably break the semver.

This'd allow me to use utf-8 aware splitting without having to really be considerate of boundaries -- let the std library provide that. I have quite a bit already pushed for the 0.15.x branch (https://github.com/museun/twitchchat/pull/226). I can take this into consideration, but I'm still looking at providing one last 0.14.x release. I'm going to think about a way of not breaking the semver when changing all of the internal memory representation of the types.

museun commented 4 years ago

A bit more thinking, I can just change the tags representation -- it already has to allocate a boxed slice: https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tag_indices.rs#L8-L10

I can just make this a Box<[(Cow<'a, str>, Cow<'a, str>)]> internally and it wouldn't change the public API. This would ensure the tag and its index (now removed) are always using the same character (code point/scalar) boundaries.

I would basically just remove the indices and make https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tags.rs#L9-L13 simpler.

I don't think I expose most of this to the user so it shouldn't be breaking.

olback commented 4 years ago

Looks great.

display-name can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.

Ah, neat.

museun commented 4 years ago

I found a workaround, its not ideal but its transparent for the most part.

I've published 0.14.6 which fixes this problem.

olback commented 4 years ago

Thank you!

museun / twitchchat

Crate panics when IRC Tag contain non ascii characters #228