StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java [LUCENE-8572]

mikemccand commented 5 years ago

With "lucene-queryparser-6.3.0", specifically in "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" when escaping strings containing extended unicode chars, and with a locale distinct from that of the character set the string uses, the process fails, with a "java.lang.StringIndexOutOfBoundsException". The reason is that the comparison is done by previously converting all of the characters of the string to lower case chars, and by doing this, the original string size isn't anymore the same, but less, as of the transformed one, so that executing org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89

fails with a java.lang.StringIndexOutOfBoundsException.

I wonder whether the transformation to lower case is really needed when treating the escape chars, since by avoiding it, the error may be avoided.

Legacy Jira details

LUCENE-8572 by Octavian Mocanu on Nov 21 2018, updated Aug 10 2019

mikemccand commented 5 years ago

Hi @tonicava, thanks for opening the issue. Would you be able to provide an example or test case that illustrates the failure?

[Legacy Jira: Alan Woodward (@romseygeek) on Nov 22 2018]

mikemccand commented 5 years ago

Hi @romseygeek,

Trying e.g. at

/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:83

in

private static final CharSequence escapeTerm(CharSequence term, Locale locale)

with

term = "İpone " [304, 112, 111, 110, 101, 32]
locale = "us"

result -> StringIndexOutOfBoundsException

(it'll only work when having locale = "tr")

[Legacy Jira: Octavian Mocanu on Nov 22 2018]

mikemccand commented 5 years ago

It looks as though FieldQueryNode and PathQueryNode are using Locale.getDefault() when they should be using Locale.ROOT

cc [~thetaphi‍] who understands locales better than I do...

[Legacy Jira: Alan Woodward (@romseygeek) on Nov 22 2018]

mikemccand commented 5 years ago

Hi, @romseygeek, @thetaphi.

I checked the issue and found that it could be a logical problem.

First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself.

When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())

private static CharSequence replaceIgnoreCase(CharSequence string,
    CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
    // Convert by toLowerCase as follows.
    // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
    // firstIndex will be set 6.
    if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
        start)) == -1)
      break;
    boolean found = true;
    ...
    if (found) {
      // In this line, String.toString() will only have a range of 0 to 5.
      // So here we get a StringIndexOutOfBoundsException.
      result.append(string.toString().substring(copyStart, firstIndex));
      ...
    } else {
      start = firstIndex + 1;
    }
  }
  ...
}

Solving this may not be a big problem.

But what do you think about using

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      // Use String's replace method.
      buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
    }
    return buffer;
  }

instead of

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      // Stay current method.
      buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
    }
    return buffer;
  }

in the escapeWhiteChar method?

[Legacy Jira: Namgyu Kim (@danmuzi) on Nov 30 2018]

mikemccand commented 5 years ago

Hi, in my opinion, the proposal of @danmuzi would be a nice solution, mainly for it avoids the problematic use of toLowerCase.

Best!

[Legacy Jira: Octavian Mocanu on Dec 15 2018]

mikemccand commented 5 years ago

I find the code is relative to LUCENE-4199 . Maybe we should implement it like:

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
      buffer = buffer.toString().replace(escapableWhiteChars[i].toLowerCase(locale), "\\");
    }
    return buffer;
  }

[Legacy Jira: Chongchen Chen on Aug 10 2019]

mikemccand / stargazers-migration-test

StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java [LUCENE-8572] #571

Legacy Jira details