Scanner.scan utf8ByteIndexesMapping Array Out Of Bounds

crusse54 commented 2 years ago

final byte[] utf8bytes = input.getBytes(StandardCharsets.UTF_8) encodes the unknown character as a single byte ASCII question mark. When Util.utf8ByteIndexesMapping(input, bytesLength); creates the byteIndexes array of size bytesLength (the size of utf8bytes), it creates an array that is 2 smaller than it should be since the unknown character is 3 bytes. The set of if statements then references the original string and decide that the unknown character is 3 bytes, filling 3 array spots with the character index. Eventually Array.fill goes out of bounds and an exception is thrown.

gliwka commented 11 months ago

Interesting, invalid UTF-8 characters are not handled properly and are replaced with an ASCII "?" instead of with U+FFFD. I will look into a different API that handles that correctly or a lower-level API that give me more control on how invalid chars are processed.

gliwka commented 11 months ago

Alternatively, I might also find a way to directly operate on the strings without the mapping. Let me know if you also have any ideas.

yenuka78 commented 9 months ago

We also encountered this bug. When is a fix expected?

gliwka commented 9 months ago

Fix will be released in the upcoming days

gliwka / hyperscan-java

Scanner.scan utf8ByteIndexesMapping Array Out Of Bounds #170