tcowans / owasp-java-html-sanitizer

Automatically exported from code.google.com/p/owasp-java-html-sanitizer
Other
1 stars 0 forks source link

Stackoverflow sanitizing HTML with large inline background-image #16

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Sanitize with ExampleTest.java the string below causes StackOverflow (with 
r173-EbayPolicy-based code), likely due to very deep regular expression tree. 
Ran into this sanitizing a large email collection. Verified with clean r176 by 
adding a test to ExamplesTest.java. If I sufficiently shorten the image data, I 
no longer get stack overflow.

1) testDataImage(org.owasp.html.ExamplesTest)java.lang.StackOverflowError
    at java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
[...repeats nothing more useful at bottom of list...]

The test is (String all on one line):

  public final void testDataImage() {
        String input="<a class=\"atc_s addthis_button_compact\" style=\"background-image:url();\"></a>";
    String sanitized = EbayPolicyExample.POLICY_DEFINITION.sanitize(input);
    System.out.println(sanitized);
  }

What is the expected output? What do you see instead?
Sanitized HTML with this style monstrosity removed or passed.

What version of the product are you using? On what operating system?
r173-based code, repeated with test case in clean r176.

$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) Server VM (build 20.45-b01, mixed mode)

Linux 3.2.0-49-generic-pae #75-Ubuntu SMP

Attached gz version of smallest HTML fragment that causes error (image data can 
be shortened further, but attached is with original image data).

Thanks!

Fred

PS:

FWIW -
in StylingPolicy.html, the '.' in the below regexp should probably be escaped. 
However, doing so does not help with above problem (also doesn't cause new 
failures):

  private static final Pattern NON_NEGATIVE_LENGTH = Pattern.compile(
      "(?:0|[1-9][0-9]*)([.][0-9]+)?(ex|[ecm]m|v[hw]|p[xct]|in|%)?");

I also though in CssGrammar.java, the URL_CHARS regexp should require at least 
one character (didn't check with actual CssGrammar, though) by makeing '*' into 
'+', but that also did not abolish the problem (also doesn't cause new 
failures):

    // url chars               ({url_special_chars}|{nonascii}|{escape})+
    String URL_CHARS = "(?:"
        + url_special_chars + "|" + nonascii + "|" + escape + ")+";

Original issue reported on code.google.com by fred.lin...@gmail.com on 16 Jul 2013 at 2:51

Attachments:

GoogleCodeExporter commented 9 years ago
Is the image copyrighted?

Thanks for the note on the regexs.  I have a rewrite of the CSS lexer mostly 
ready to go which gets rid of regular expressions (and the associated unbounded 
recursion and backtracking) entirely.

Original comment by mikesamuel@gmail.com on 16 Jul 2013 at 4:22

GoogleCodeExporter commented 9 years ago
Great!

Copyright - I don't know.

It looks like someone sent this page:
http://www.thesouthafrican.com/entertainment/south-african-couple-launches-engli
sh-bubbly-to-rival-champagne.htm

and that the image is the red "+share" icon at the top. Attaching as both 
base64 (from the html) and decoded to gif.

Original comment by fred.lin...@gmail.com on 16 Jul 2013 at 9:56

Attachments:

GoogleCodeExporter commented 9 years ago
Attached, Untitled.zip that includes Untitled.html, Untitled.b64, and 
Untitled.gif that I made using gimp. It triggers the same error.

I hereby release this to the public domain without any warranty explicit or 
implied.

Original comment by fred.lin...@gmail.com on 16 Jul 2013 at 10:01

Attachments:

GoogleCodeExporter commented 9 years ago
Great.  Thanks.

Original comment by mikesamuel@gmail.com on 16 Jul 2013 at 10:38

GoogleCodeExporter commented 9 years ago
https://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=180 
replaces the CSS lexer with one that passes your test, and doesn't backtrack 
but I'm not going to close this bug until that's production ready since the new 
code has not been thoroughly vetted on malformed inputs.  

My plan thus far is

1. Rewrite the CSS filter with a token-level filter based on Caja white-lists ( 
https://code.google.com/p/google-caja/wiki/CajaWhitelists ) that conservatively 
identifies all URLs, and normalizes tokens.  This should fix all the border 
problems by being more permissive and put us in a place to allow data URLs.
2. Test the new lexer with fuzzers and white-box tests until I'm confident that 
there's no inf. loops.
3. Push a release with the more liberal CSS sanitizer.
4. Look into data: URLs and plan from there.

I should be able to get 1-3 done this week or next, but feel free to play 
around with trunk in the meantime but please don't roll out to production yet.

Original comment by mikesamuel@gmail.com on 17 Jul 2013 at 12:24

GoogleCodeExporter commented 9 years ago
Re data: attributes, I'm unsure what to do there.

https://www.owasp.org/images/0/03/Mario_Heiderich_OWASP_Sweden_The_image_that_ca
lled_me.pdf suggests that allowing images is not ok, and I don't know whether 
browsers agree on the origin of an image from a data URL.

I could whitelist

    data:image/gif;base64,...
    data:image/png;base64,...
    data:image/jpeg;base64,...

where the first 4 characters of ... are the b64 encoding of the first 3 
characters of the magic number for that image type.

That doesn't eliminate the risk of polyglots like 
http://www.thinkfu.com/blog/gifjavascript-polyglots but combined with the 
explicit mime-type should suffice.

Original comment by mikesamuel@gmail.com on 17 Jul 2013 at 10:20

GoogleCodeExporter commented 9 years ago
Tested with trunk/r198 and can confirm that the stack overflow no longer 
happens. For my application, I don't miss the image. We want to show email 
reasonably faithfully, but removing attack vectors in much more important that 
look and this type of inline image data is rare (it was one document in about 
150,000).

Original comment by fred.lin...@gmail.com on 18 Jul 2013 at 7:50

GoogleCodeExporter commented 9 years ago
Stackoverflow is fixed at r198.

Punting on support for data URLs until there is a client who really needs image 
embedding via CSS.

Original comment by mikesamuel@gmail.com on 24 Jul 2013 at 3:47

GoogleCodeExporter commented 9 years ago
Thanks, Mike!! OK in testing. Will use in prod as soon as it shows up on maven 
central.

Original comment by fred.lin...@gmail.com on 24 Jul 2013 at 4:20