getTagPosition not correct with multi-byte unicode

oblac / jodd-lagarto

Java HTML parsers suite.

https://lagarto.jodd.org

BSD 2-Clause "Simplified" License

20 stars 5 forks source link

getTagPosition not correct with multi-byte unicode #21

Closed RXminuS closed 2 years ago

RXminuS commented 2 years ago

I'm still digging down exactly where this happens but it seems that multi-byte characters aren't correctly translated into byte positions in the source. For instance if you insert something like 👨‍👨‍👦‍👦 (4 bytes) into your html you'll see that all the tag positions following this text will be offset by the amount of bytes missed for the single character (so 3 bytes off from that point on)

igr commented 2 years ago

Hey @RXminuS! Is this still a valid issue?