jruby / jcodings

Java-based codings helper classes for Joni and JRuby
MIT License
21 stars 29 forks source link

Unable to find org.jcodings.specific.BaseUTF8Encoding.mbcCaseFold #25

Closed ahorek closed 5 years ago

ahorek commented 6 years ago

hi @lopex the recent mail build started to fail https://travis-ci.org/mikel/mail/jobs/435704866 https://github.com/mikel/mail

not sure if the problem is in joni or jcodings. If you have time, please take a look, thanks.

Failure/Error: Unable to find org.jcodings.specific.BaseUTF8Encoding.mbcCaseFold(BaseUTF8Encoding.java to read failed line

     Java::JavaLang::ArrayIndexOutOfBoundsException:
       -2
     # org.jcodings.specific.BaseUTF8Encoding.mbcCaseFold(BaseUTF8Encoding.java:152)
     # org.jcodings.specific.UTF8Encoding.mbcCaseFold(UTF8Encoding.java:22)
     # org.joni.Search.lowerCaseMatch(Search.java:42)
     # org.joni.Search.access$000(Search.java:27)
     # org.joni.Search$11.search(Search.java:439)
     # org.joni.Matcher.forwardSearchRange(Matcher.java:137)
     # org.joni.Matcher.searchCommon(Matcher.java:425)
     # org.joni.Matcher.search(Matcher.java:301)
     # org.jruby.RubyRegexp.matcherSearch(RubyRegexp.java:231)
     # org.jruby.RubyRegexp.search(RubyRegexp.java:1306)
     # org.jruby.RubyRegexp.matchPos(RubyRegexp.java:1195)
     # org.jruby.RubyRegexp.op_match(RubyRegexp.java:1113)
     # org.jruby.RubyString.op_match(RubyString.java:1656)
     # org.jruby.RubyString$INVOKER$i$1$0$op_match.call(RubyString$INVOKER$i$1$0$op_match.gen)
     # org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
     # home.travis.build.mikel.mail.lib.mail.encodings.invokeOther6:=~(/home/travis/build/mikel/mail/lib/mail/encodings.rb:125)
     # home.travis.build.mikel.mail.lib.mail.encodings.RUBY$method$value_decode$0(/home/travis/build/mikel/mail/lib/mail/encodings.rb:125)
...
lopex commented 6 years ago

reduced case:

"\u{1F48C}" =~ /\=\?/i
lopex commented 6 years ago

This is related to https://github.com/jruby/joni/issues/17, Onigmo appears to compare first two bytes of "\u{1F48C}" to "=?" in exact info regexp field (used by fast skip algorithms). It uses for that mbclen(enc, p, end) function aka onigenc_mbclen_approximate which will never return negative values and acts as a safeguard for broken characters.

lopex commented 6 years ago

The issue was introduced with https://github.com/jruby/joni/commit/012bb20e520eb607ab2c7d6e271cdb140e353b88 which turned on Search.BM_IC fast skip boyer-moore / sunday case insensitive search routine. The problem doesnt seem to be in the routine itself, but how case insensitive comparison is being handled. Until we find the solution we can fallback to Search.SLOW_IC for now.

lopex commented 6 years ago

Temporary fix is in https://github.com/jruby/joni/commit/118dbdeecb42d736ed3dbbcccce13f2fb98753b7 which will not degrade performance from previous versions. Keeping the issue open until we decide on adding unsave and approximate length routines to org.jcodings.Encoding.

lopex commented 6 years ago

@ahorek joni is released and jruby snaps updated, thanks for the report.

headius commented 5 years ago

@lopex Is there a further fix needed here?

lopex commented 5 years ago

The ultimate fix would be to implement approximate length for our encodings. For now, as a workaround, Sunday search is turned off for case insensitive forward searches.

lopex commented 5 years ago

Closing, created a new issue that explains it here https://github.com/jruby/jcodings/issues/26