Configured regex has unbounded O(n^2) execution time

GoogleCodeExporter commented 9 years ago

Exploit: "Cause an exception, crash, or inf. loop in the sanitizer that causes 
it to fail to provide service or consume inordinate resources for an input of 
that size.

What steps will reproduce the problem?
1.Enter <a href="http://x            '">t</a> with a large number of spaces 
between the x and the '
2. 15kb of spaces -> 2s execution time. 60kb of spaces -> 171s execution time

What is the expected output? What do you see instead?
Expected: Worst case execution time is at most O(n) for an input of size n (or 
execution time limited to mitigate an attack on server resources)
Observed: Worst case execution time is O(n^2) for an input of size n

What version of the product are you using? On what operating system?
As currently on http://canyouxssthis.com/HTMLSanitizer/reflect (no version 
specified)

Please provide any additional information below.

Not sure if this is "in bounds" since it is a problem with the configuration of 
the sanitizer rather than the sanitizer per se. However, for the sanitizer to 
be effective for others who want to use it, they should receive thoroughly 
tested configurations along with it.

The cause for this defect is the regex defined in 
Pattern OFFSITE_URL = 
Pattern.compile("(\\s)*((ht|f)tp(s?)://|mailto:)[\\p{L}\\p{N}][\\p{L}\\p{N}\\p{Z
s}\\.\#@\$%\\+&;:\\-_~,\\?=/!\\(\\)]*(\\s)*");
Two sequential character classes are starred and not disjoint. Specifically: 
[\\p{Zs}]*(\\s)*

Original issue reported on code.google.com by fabiocbi...@gmail.com on 27 Feb 2014 at 7:39

GoogleCodeExporter commented 9 years ago

Marking private until triage and fix.

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 7:40

Added labels: Private

GoogleCodeExporter commented 9 years ago

> for the sanitizer to be effective for others who want to use it, they should 
receive thoroughly tested configurations along with it.

agreed.

> Two sequential character classes are starred and not disjoint. Specifically: 
[\\p{Zs}]*(\\s)*

Would making the first greedy address the problem?

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 7:41

GoogleCodeExporter commented 9 years ago

Removing private flag since I triaged as non-critical.

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 8:27

Changed state: Accepted
Removed labels: Private

GoogleCodeExporter commented 9 years ago

https://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=217 
addresses this

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 8:30

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Making either star lazy/greedy does not change the O(n^2) problem, it merely 
rearranges the loop order that the regex engine will follow through those 
O(n^2) steps.

You should always make adjacent starred (or plused) character classes disjoint 
(they don't share any characters that match in both). In this case, since your 
first character class already matches any \p{Zs} character, the remaining 
disjoint set of characters in the second class is practically empty. The way 
you remove this adjacent overlap of the character classes depends on whether 
you need to capture the trailing whitespace at the end (e.g. for preserving it 
in the output). My sense is that you don't, in which case, you should simply 
remove the \\s* (whitespace characters), since it is already largely matched by 
\p{Zs} (a whitespace character that is invisible, but does take up space). If 
you do, you should require a non-whitespace character before you start matching 
"trailing" whitespace.

Original comment by fabiocbi...@gmail.com on 27 Feb 2014 at 9:12

GoogleCodeExporter commented 9 years ago

> Making either star lazy/greedy does not change the O(n^2) problem, it merely 
rearranges the loop order that the regex engine will follow through those 
O(n^2) steps.

Sorry, I misspoke.  I changed it be "possessive", not "greedy".

"Possessive" is defined thus at 
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html :

"""* Possessive quantifiers, which greedily match as much as they can and do 
not back off, even when doing so would allow the overall match to succeed."""

so my understanding is that the extra '+' in (x++) changes the backtracking 
mode from Prolog-style backtracking to PEG-style backtracking.

----

That said, I think you're right about

> whether you need to capture the trailing whitespace at the end (e.g. for 
preserving it in the output). My sense is that you don't, 

so there is no loss of function.  I will eliminate that and leave a comment as 
to why it's unnecessary.

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 9:38

GoogleCodeExporter commented 9 years ago

Yes, I think any of those would be appropriate fixes, including the possessive 
modifier you mentioned.

Original comment by fabiocbi...@gmail.com on 27 Feb 2014 at 9:44

GoogleCodeExporter commented 9 years ago

I can only eliminate the trailing \s* if \s is a subset of \p{Zs} but it is not 
because \s includes code-points in the control character (Cc) category which is 
disjoint with the Z category.

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt says of TAB and SPACE
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
...
0020;SPACE;Zs;0;WS;;;;;N;;;;;

This affects HTML5 because HTML5 defines the space that can be ignored at the 
beginning and end of certain attribute values at 
http://www.w3.org/TR/html5/infrastructure.html#space-character thus:
"""
The space characters, for the purposes of this specification, are U+0020 SPACE, 
"tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).
"""

----
Empirically, in my version of JDK6, the differences between \p{Zs} and \s for 
code-points < U+E000 are:

9 inP=false, inQ=true
a inP=false, inQ=true
b inP=false, inQ=true
c inP=false, inQ=true
d inP=false, inQ=true
a0 inP=true, inQ=false
1680 inP=true, inQ=false
180e inP=true, inQ=false
2000 inP=true, inQ=false
2001 inP=true, inQ=false
2002 inP=true, inQ=false
2003 inP=true, inQ=false
2004 inP=true, inQ=false
2005 inP=true, inQ=false
2006 inP=true, inQ=false
2007 inP=true, inQ=false
2008 inP=true, inQ=false
2009 inP=true, inQ=false
200a inP=true, inQ=false
200b inP=true, inQ=false
202f inP=true, inQ=false
205f inP=true, inQ=false
3000 inP=true, inQ=false

as derived by

import java.util.regex.*;

public class Foo {
  public static void main(String[] argv) {
    Pattern p = Pattern.compile("[\\p{Zs}]");
    Pattern q = Pattern.compile("\\s");
    for (char c = 0; c < 0xE000; ++c) {
      String s = new StringBuilder(1).append(c).toString();
      boolean inP = p.matcher(s).matches();
      boolean inQ = q.matcher(s).matches();
      if (inP != inQ) {
        System.out.println(
            Integer.toString(c, 16) + " inP=" + inP + ", inQ=" + inQ);
      }
      if (c == ' ' && !inP) {  // Sanity check
        System.err.println("Something widgy");
      }
    }
  }
}

Original comment by mikesamuel@gmail.com on 27 Feb 2014 at 10:55

GoogleCodeExporter commented 9 years ago

The release with the fix is r223 which is currently staging to Maven central 
and should be available at http://search.maven.org/#browse%7C84770979 shortly.

Original comment by mikesamuel@gmail.com on 28 Feb 2014 at 9:57

Changed state: Fixed

tcowans / owasp-java-html-sanitizer

Configured regex has unbounded O(n^2) execution time #25