Bug in handling of surrogates while counting utf8 characters

hiredman commented 4 years ago

Hi, I believe this line is incorrectly counting 4 byte characters as a single character instead of two surrogate characters: https://github.com/mrniko/netty-socketio/blob/bed2a3b05d68a5ac84eba32c59d410c0f88dc0d7/src/main/java/com/corundumstudio/socketio/protocol/UTF8CharsScanner.java#L93

hiredman commented 4 years ago

For context: Java and javascript both encode strings as utf16 (or ucs2 or whatever), which uses surrogate pairs (two characters) to encode characters outside of the basic multi-lingual plane.

The client socket.io code is json encoding the data being sent to the server. Json is utf8, which encodes characters outside of the bmp as a single character.

So the client is counting utf16 characters, json encoding as utf8, then the server tries to figure out how many bytes that a character count covers by counting characters. The problem is you need to walk utf8 characters (the json encoding) but count them as utf16 characters (how js and java counts them).

We noticed this problem because we were having issues where some messages containing newer emojis(outside the bmp) would cause errors when the packet was being decoded. In our code I have used reflection to monkey patch in our own character counter and have not encountered anymore errors.

I would submit a patch, but on a fresh checkout I can't get the tests for the project to pass.

mrniko commented 4 years ago

I refactored code a bit. Tests ignored now, since they are not actual to current version of supported protocol.

hiredman commented 2 years ago

This still seems to be an issue. The base64 encoding of the utf8 bytes of the string we use to test this at work is:

fvCdmIjhuIbwnZai8J2Vr9mk4bie1I3QnceP8J2ZhcaY1LjispjwnZmJ4KemzqHwnZekyYzwnZOiyJrQpvCdkrHRoPCdk6fGs8ik0afhlq/Eh/Cdl7Hhu4XwnZGT8J2ZnOGCufCdnrLwnZGX8J2SjMS84bmDxYnQvvCdno7wnZKS4bWy6pyx8J2ZqeG7q/Cdl4/FtfCdkpnwnZKaxboxMjM0NTY3ODkwIUAjJCVeJiooKS1fPStbe119OzonLDwuPmZvbyDwn6Sq8J+YjdCh0YrQtdGI0Ywg0LbQtSDQtdGJ0ZEg0Y3RgtC40YUg0LzRj9Cz0LrQuNGFINGE0YDQsNC90YbRg9C30YHQutC40YUg0LHRg9C70L7QuiDQtNCwINCy0YvQv9C10Lkg0YfQsNGOICAg44Kk44Ot44OP44OL44Ob44OY44OIIOODgeODquODjOODq+ODsiDjg6/jgqvjg6jjgr/jg6zjgr0g44OE44ON44OK44Op44Og4Zqg4ZuH4Zq74Zur4ZuS4Zum4Zqm4Zur4Zqg4Zqx4Zqp4Zqg4Zqi4Zqx4Zur4Zqg4ZuB4Zqx4Zqq4Zur4Zq34ZuW4Zq74Zq54Zum4Zua4Zqz4Zqi4ZuXIOCur+CuvuCuruCuseCuv+CuqOCvjeCupCDgrq7gr4rgrrTgrr/grpXgrrPgrr/grrLgr4cg4K6k4K6u4K6/4K604K+N4K6u4K+K4K604K6/IOCuquCvi+CusuCvjSDgrofgrqngrr/grqTgrr7grrXgrqTgr4Eg4K6O4K6Z4K+N4K6V4K+B4K6u4K+NIOCuleCuvuCuo+Cvi+CuruCvjSxlbmQg4LKs4LK+IOCyh+CysuCzjeCysuCyvyDgsrjgsoLgsq3gsrXgsr/gsrjgs4Eg4LKH4LKC4LKm4LOG4LKo4LON4LKoIOCyueCzg+CypuCyr+CypuCysuCyvyDvu7/gpJXgpL7gpJrgpIIg4KS24KSV4KWN4KSo4KWL4KSu4KWN4KSv4KSk4KWN4KSk4KWB4KSu4KWNIOClpCDgpKjgpYvgpKrgpLngpL/gpKjgpLjgpY3gpKTgpL8g4KSu4KS+4KSu4KWNIOClpSAu2YXZhiDZhduMINiq2YjYp9mG2YUg2KjYr9mI2YbZkCDYp9it2LPYp9izINiv2LHYryDYtNmK2LTZhyDYqNiu2YjYsdmF

We currently go in with reflection and and replace the instance of UTF8CharsScanner with our own subclass of it, which implements counting as a sort of abstract machine. https://github.com/worldsingles/netty-socketio/commit/422ddf267481442e143927e2e6bbc9f62e532cf7 is a java port of the algorithm we are using (our code in production is written in clojure)

We've been doing this for around two years now, it seems to work.

hiredman commented 2 years ago

this is a list of issues that might be/have been the result of the miscounting:

319
349
391
534
559
843

mrniko / netty-socketio

Bug in handling of surrogates while counting utf8 characters #754

319

349

391

534

559

843