Open mikemccand opened 5 years ago
Hi @tonicava
, thanks for opening the issue. Would you be able to provide an example or test case that illustrates the failure?
[Legacy Jira: Alan Woodward (@romseygeek) on Nov 22 2018]
Hi @romseygeek,
Trying e.g. at
/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:83
in
private static final CharSequence escapeTerm(CharSequence term, Locale locale)
with
term = "İpone " [304, 112, 111, 110, 101, 32]
locale = "us"
result -> StringIndexOutOfBoundsException
(it'll only work when having locale = "tr")
[Legacy Jira: Octavian Mocanu on Nov 22 2018]
It looks as though FieldQueryNode and PathQueryNode are using Locale.getDefault()
when they should be using Locale.ROOT
cc [~thetaphi] who understands locales better than I do...
[Legacy Jira: Alan Woodward (@romseygeek) on Nov 22 2018]
Hi, @romseygeek, @thetaphi.
I checked the issue and found that it could be a logical problem.
First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself.
When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
// string = "İpone " [304, 112, 111, 110, 101, 32], size = 6
...
while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
break;
boolean found = true;
...
if (found) {
// In this line, String.toString() will only have a range of 0 to 5.
// So here we get a StringIndexOutOfBoundsException.
result.append(string.toString().substring(copyStart, firstIndex));
...
} else {
start = firstIndex + 1;
}
}
...
}
Solving this may not be a big problem.
But what do you think about using
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
// Use String's replace method.
buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
}
return buffer;
}
instead of
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
// Stay current method.
buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
}
return buffer;
}
in the escapeWhiteChar method?
[Legacy Jira: Namgyu Kim (@danmuzi) on Nov 30 2018]
Hi, in my opinion, the proposal of @danmuzi would be a nice solution, mainly for it avoids the problematic use of toLowerCase.
Best!
[Legacy Jira: Octavian Mocanu on Dec 15 2018]
I find the code is relative to LUCENE-4199 . Maybe we should implement it like:
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
buffer = buffer.toString().replace(escapableWhiteChars[i].toLowerCase(locale), "\\");
}
return buffer;
}
[Legacy Jira: Chongchen Chen on Aug 10 2019]
With "lucene-queryparser-6.3.0", specifically in "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" when escaping strings containing extended unicode chars, and with a locale distinct from that of the character set the string uses, the process fails, with a "java.lang.StringIndexOutOfBoundsException". The reason is that the comparison is done by previously converting all of the characters of the string to lower case chars, and by doing this, the original string size isn't anymore the same, but less, as of the transformed one, so that executing org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89
fails with a java.lang.StringIndexOutOfBoundsException.
I wonder whether the transformation to lower case is really needed when treating the escape chars, since by avoiding it, the error may be avoided.
Legacy Jira details
LUCENE-8572 by Octavian Mocanu on Nov 21 2018, updated Aug 10 2019