Open crusse54 opened 2 years ago
Interesting, invalid UTF-8 characters are not handled properly and are replaced with an ASCII "?" instead of with U+FFFD. I will look into a different API that handles that correctly or a lower-level API that give me more control on how invalid chars are processed.
Alternatively, I might also find a way to directly operate on the strings without the mapping. Let me know if you also have any ideas.
We also encountered this bug. When is a fix expected?
Fix will be released in the upcoming days
final byte[] utf8bytes = input.getBytes(StandardCharsets.UTF_8)
encodes the unknown character as a single byte ASCII question mark. WhenUtil.utf8ByteIndexesMapping(input, bytesLength);
creates thebyteIndexes
array of sizebytesLength
(the size ofutf8bytes
), it creates an array that is 2 smaller than it should be since the unknown character is 3 bytes. The set of if statements then references the original string and decide that the unknown character is 3 bytes, filling 3 array spots with the character index. EventuallyArray.fill
goes out of bounds and an exception is thrown.