Closed GoogleCodeExporter closed 9 years ago
Marking private until triage and fix.
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 7:40
> for the sanitizer to be effective for others who want to use it, they should
receive thoroughly tested configurations along with it.
agreed.
> Two sequential character classes are starred and not disjoint. Specifically:
[\\p{Zs}]*(\\s)*
Would making the first greedy address the problem?
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 7:41
Removing private flag since I triaged as non-critical.
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 8:27
https://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=217
addresses this
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 8:30
[deleted comment]
Making either star lazy/greedy does not change the O(n^2) problem, it merely
rearranges the loop order that the regex engine will follow through those
O(n^2) steps.
You should always make adjacent starred (or plused) character classes disjoint
(they don't share any characters that match in both). In this case, since your
first character class already matches any \p{Zs} character, the remaining
disjoint set of characters in the second class is practically empty. The way
you remove this adjacent overlap of the character classes depends on whether
you need to capture the trailing whitespace at the end (e.g. for preserving it
in the output). My sense is that you don't, in which case, you should simply
remove the \\s* (whitespace characters), since it is already largely matched by
\p{Zs} (a whitespace character that is invisible, but does take up space). If
you do, you should require a non-whitespace character before you start matching
"trailing" whitespace.
Original comment by fabiocbi...@gmail.com
on 27 Feb 2014 at 9:12
> Making either star lazy/greedy does not change the O(n^2) problem, it merely
rearranges the loop order that the regex engine will follow through those
O(n^2) steps.
Sorry, I misspoke. I changed it be "possessive", not "greedy".
"Possessive" is defined thus at
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html :
"""* Possessive quantifiers, which greedily match as much as they can and do
not back off, even when doing so would allow the overall match to succeed."""
so my understanding is that the extra '+' in (x++) changes the backtracking
mode from Prolog-style backtracking to PEG-style backtracking.
----
That said, I think you're right about
> whether you need to capture the trailing whitespace at the end (e.g. for
preserving it in the output). My sense is that you don't,
so there is no loss of function. I will eliminate that and leave a comment as
to why it's unnecessary.
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 9:38
Yes, I think any of those would be appropriate fixes, including the possessive
modifier you mentioned.
Original comment by fabiocbi...@gmail.com
on 27 Feb 2014 at 9:44
I can only eliminate the trailing \s* if \s is a subset of \p{Zs} but it is not
because \s includes code-points in the control character (Cc) category which is
disjoint with the Z category.
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt says of TAB and SPACE
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
...
0020;SPACE;Zs;0;WS;;;;;N;;;;;
This affects HTML5 because HTML5 defines the space that can be ignored at the
beginning and end of certain attribute values at
http://www.w3.org/TR/html5/infrastructure.html#space-character thus:
"""
The space characters, for the purposes of this specification, are U+0020 SPACE,
"tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).
"""
----
Empirically, in my version of JDK6, the differences between \p{Zs} and \s for
code-points < U+E000 are:
9 inP=false, inQ=true
a inP=false, inQ=true
b inP=false, inQ=true
c inP=false, inQ=true
d inP=false, inQ=true
a0 inP=true, inQ=false
1680 inP=true, inQ=false
180e inP=true, inQ=false
2000 inP=true, inQ=false
2001 inP=true, inQ=false
2002 inP=true, inQ=false
2003 inP=true, inQ=false
2004 inP=true, inQ=false
2005 inP=true, inQ=false
2006 inP=true, inQ=false
2007 inP=true, inQ=false
2008 inP=true, inQ=false
2009 inP=true, inQ=false
200a inP=true, inQ=false
200b inP=true, inQ=false
202f inP=true, inQ=false
205f inP=true, inQ=false
3000 inP=true, inQ=false
as derived by
import java.util.regex.*;
public class Foo {
public static void main(String[] argv) {
Pattern p = Pattern.compile("[\\p{Zs}]");
Pattern q = Pattern.compile("\\s");
for (char c = 0; c < 0xE000; ++c) {
String s = new StringBuilder(1).append(c).toString();
boolean inP = p.matcher(s).matches();
boolean inQ = q.matcher(s).matches();
if (inP != inQ) {
System.out.println(
Integer.toString(c, 16) + " inP=" + inP + ", inQ=" + inQ);
}
if (c == ' ' && !inP) { // Sanity check
System.err.println("Something widgy");
}
}
}
}
Original comment by mikesamuel@gmail.com
on 27 Feb 2014 at 10:55
The release with the fix is r223 which is currently staging to Maven central
and should be available at http://search.maven.org/#browse%7C84770979 shortly.
Original comment by mikesamuel@gmail.com
on 28 Feb 2014 at 9:57
Original issue reported on code.google.com by
fabiocbi...@gmail.com
on 27 Feb 2014 at 7:39