jruby / jcodings

Java-based codings helper classes for Joni and JRuby
MIT License
20 stars 29 forks source link

Implement approximate length and other length routines for proper broken character processing #26

Open lopex opened 5 years ago

lopex commented 5 years ago

MRI has several character length routines that have different semantics and are used quite inconsistently, wiki: https://github.com/jruby/jruby/wiki/Encodings-in-JRuby.

For now we only have two semantics:

There are several issues: https://github.com/jruby/jcodings/issues/25 https://github.com/jruby/joni/issues/38 https://github.com/jruby/joni/issues/17 https://github.com/jruby/joni/issues/46

All of those are related to semantics where length returns 1 for invalid character, so scans can advance while consuming arrays (were we have -1 and fall into infinite loops or AIOOBE)

Presto mitigated some of that by using our NonStrictUtf8Encoding here: https://github.com/prestodb/presto/issues/8711

Ultimately, we need to decide whether to scatter our code with more costly validating length routines (which would be wasteful for already validated Strings), or try a less wasteful approach by expanding on https://github.com/jruby/jcodings/tree/unsafe-encoding