Closed olback closed 4 years ago
I see, was this from Twitch or were your parsing custom messages?
I should look into issue templates, all I really need is the input data (or sample), the problem and the crate version. Its in the tags which I last added string escaping for. I'll see where its getting the indices wrong. I also assume it was the display-name
tag.
Yes, this was from Twitch. I don't really know how a non ascii character even gets into the display-name as it has to be the same as your username, just different capitalization.
When titling the issue I assumed that all tags are parsed the same way, the issue is with display-name
though.
Edit: Managed to get the user-id
from one of the crashing messages, here is a partial response from the /helix/users
endpoint with user in question:
{
"data": [
{
"id": "86293428",
"login": "yuebing233",
"display_name": "月饼",
...
}
]
}
display-name
can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.
Looking at the code, I don't know if I can fix this in 0.14.x
. There is a fatal flaw here: https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tag_indices.rs#L39-L40
I would need to calculate the byte offset of all further 'chars' which is something I don't really want to do.
I kind of want to get rid of the the whole super-cheap indices approach. Currently, all of the messages, each, use a single allocation and then provide their 'sub-strings' as indices. These indices refer back to the single &str/String
. But getting rid of this and just moving to a naive 'struct of indv. &str/String
would probably break the semver.
This'd allow me to use utf-8 aware splitting without having to really be considerate of boundaries -- let the std library provide that. I have quite a bit already pushed for the 0.15.x
branch (https://github.com/museun/twitchchat/pull/226). I can take this into consideration, but I'm still looking at providing one last 0.14.x
release. I'm going to think about a way of not breaking the semver when changing all of the internal memory representation of the types.
A bit more thinking, I can just change the tags representation -- it already has to allocate a boxed slice: https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tag_indices.rs#L8-L10
I can just make this a Box<[(Cow<'a, str>, Cow<'a, str>)]>
internally and it wouldn't change the public API. This would ensure the tag and its index (now removed) are always using the same character (code point/scalar) boundaries.
I would basically just remove the indices and make https://github.com/museun/twitchchat/blob/1463334dcbbf64fddc8a7f3d72a2358d5295eb8c/src/irc/tags.rs#L9-L13 simpler.
I don't think I expose most of this to the user so it shouldn't be breaking.
Looks great.
display-name
can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.
Ah, neat.
I found a workaround, its not ideal but its transparent for the most part.
I've published 0.14.6
which fixes this problem.
Thank you!
Here's the panic message:
Version:
0.14.5
cargo --version -v
:rustc --version -v
:Side note: Consider adding issue templates to make it easier to report bugs/make feature request :+1: