Error parsing <style> contents containing CDATA

GoogleCodeExporter commented 9 years ago

From Gang Zheng:

I tried the following input string with AnitSamy and encountered an exception:

Input String: <style type="text/css"><![CDATA[P {  margin-bottom:
0.08in; } ]]></style>

org.apache.batik.css.parser.ParseException: character
       at org.apache.batik.css.parser.Scanner.nextToken(Scanner.java:381)
       at org.apache.batik.css.parser.Scanner.next(Scanner.java:222)
       at org.apache.batik.css.parser.Parser.parseStyleSheet(Parser.java:185)
       at
org.owasp.validator.css.CssScanner.scanStyleSheet(CssScanner.java:124)
       at
org.owasp.validator.html.scan.AntiSamyDOMScanner.recursiveValidateTag(AntiSamyDO
MScanner.java:318)
       at
org.owasp.validator.html.scan.AntiSamyDOMScanner.scan(AntiSamyDOMScanner.java:13
5)
       at org.owasp.validator.html.AntiSamy.scan(AntiSamy.java:99)

I traced the code, and it seems that AntiSamy passes the text of
"<![CDATA[P {  margin-bottom: 0.08in; } ]]>" to the CSS scanner, and
the CSS scanner does not like <![CDATA[...]]> as the surrounding of
the real style sheet contents.

If I remove the CDATA from the input and change the style sheet
contents to "<style type="text/css">P {  margin-bottom: 0.08in;
}</style>", everything works fine.

So my questions is, how can I make the AntiSamy/CSS Scanner correctly
parse the CDATA contents? After all, the CDATA section in the original
input is perfectly legal style sheet contents.

Original issue reported on code.google.com by li.jaso...@gmail.com on 24 Nov 2008 at 9:47

GoogleCodeExporter commented 9 years ago

Figured out a fix, look for this in the new version.

Original comment by arshan.d...@gmail.com on 26 Nov 2008 at 5:53

GoogleCodeExporter commented 9 years ago

Fixed!

Original comment by arshan.d...@gmail.com on 26 Nov 2008 at 6:34

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I still have the same problem in 1.3. Here is the test case:

    @Test
    public void testScanCDATA() {
        System.out.println("ScanCDATA");   
        String taintedHTML = "<![CDATA[*#autos.tabContent {display: block;}]]>";

        CssScanner scanner = new CssScanner(policy, null);
        try {
            scanner.scanStyleSheet(taintedHTML, 1000);
        }
        catch (Exception ex) {
            Assert.fail(ex.getMessage());
        }
    }

And the error message:
Caused by: org.apache.batik.css.parser.ParseException: character
        at org.apache.batik.css.parser.Scanner.nextToken(Unknown Source)
        at org.apache.batik.css.parser.Scanner.next(Unknown Source)
        at org.apache.batik.css.parser.Parser.parseStyleSheet(Unknown Source)
        at org.owasp.validator.css.CssScanner.scanStyleSheet(CssScanner.java:131)
        at
org.owasp.validator.html.scan.AntiSamyDOMScanner.recursiveValidateTag(AntiSamyDO
MScanner.java:429)

I looked into org.apache.batik.css.parser.Scanner and it recognizes <!-- but 
not <![.
May I please know what was the fix?

Original comment by ewuj...@gmail.com on 25 Mar 2009 at 9:20

GoogleCodeExporter commented 9 years ago

This is interesting! We never anticipated someone using the CssScanner class
directly. The fix was put into the AntiSamyDOMScanner class where the CssScanner
class is normally invoked. 

However, to help you (anyone else using it this way) we are going to move the 
fix
into CssScanner.

Original comment by arshan.d...@gmail.com on 25 Mar 2009 at 5:48

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Thanks for your quick response. I read the code in AntiSamyDOMScanner and there 
are
two minor issues here:

Line 420, '.' is used in the pattern and therefore it only works in single-line 
mode.
I would suggest to use '\s\S' instead. The original string is
"<![CDATA[*#autos.tabContent {\ndisplay: block;\n}]]>". That's why it gets 
through
the check in AntiSamyDOMScanner and reaches CssScanner.

Line 423, you possibly mean "isCdata = true;". Otherwise, CDATA is removed 
after the
scan.

Original comment by ewuj...@gmail.com on 26 Mar 2009 at 8:07

GoogleCodeExporter commented 9 years ago

By the way, I really like the library and wish you guys do well.

Original comment by ewuj...@gmail.com on 26 Mar 2009 at 8:39

GoogleCodeExporter commented 9 years ago

Here's a patch for the AntiSamyDOMScanner in 1.3.  I didn't fix the CssScanner.

This should allow multiline CDATA blocks in a style tag.

Enjoy ...

Original comment by dob...@gmail.com on 20 Apr 2009 at 7:34

Attachments:

cdata-multiline-style-tag.patch.txt

GoogleCodeExporter commented 9 years ago

Sorry the last patch was buggy, here's a better one.

Change: don't attempt to re-wrap the CDATA block as CDATA again, because the 
HTML 
writer does this automaticlly, and it become eminently necessary if we've got 
the 
CDATA markup inside the string, resulting in the CDATA wrapper being added 
twice.  
Not sure of a better solution so I just let the HTML writer code decide whether 
the 
CDATA is really needed or not.

Original comment by dob...@gmail.com on 20 Apr 2009 at 8:47

Attachments:

cdata-multiline-style-tag.patch.txt

GoogleCodeExporter commented 9 years ago

Ok, I've put a lot of thought into this.

First off, I've moved all the code to CssScanner.java where it belongs. This is
usually Jason's area, so hopefully he can sanity check my changes.

AntiSamy now considers your intentions of using XHTML serialization or not, 
since
this affects behavior. Strict XHTML requires stylesheet data to be in CDATA 
blocks -
and with that in mind, here is the decision tree:

If using XHTML:
  - If CDATA is found
     - Strip out the data (XHTMLSerializer will re-add it)
  - If CDATA is not found
     - Analyze and return as is (XHTMLSerializer will add it)

If not using XHTML:
  - If CDATA is found
     - Strip out the data
     - Manually add CDATA blocks around cleaned CSS data
  - If CDATA is not found
     - Return cleaned data after scanning as is

I detect CDATA intent with dobesv's regular expression, which works as 
advertised
across lines. I hope my fix that I'm committing makes everyone happy, but I'm 
not
sure if user intent in this area can be cleanly assumed by only using the XHTML
policy directive.

I would appreciate feedback in this area - I am committing this patch tonight 
so you
can test the current branch by the morning.

Original comment by arshan.d...@gmail.com on 3 Aug 2009 at 5:40

GoogleCodeExporter commented 9 years ago

Feel free to re-open if it doesn't make you happy.

Original comment by arshan.d...@gmail.com on 3 Aug 2009 at 6:23

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

You mentioned that you changed the file in your last comment, where is this
available? We are having a similar issue. The scanner appears to be inserting a 
cdata
tag but when we send the html back through it again we are getting the same 
error
org.apache.batik.css.parser.ParseException: character. Thx.

Original comment by waseul.i...@gmail.com on 27 Feb 2010 at 11:58

GoogleCodeExporter commented 9 years ago

Is there a way to prevent the XHTMLSerializer from adding the CDATA in XHTML 
mode? Maybe a directive or switch...

We are using FCK Editor and he seems to be having problems when reading the 
CDATA section on edition so I'd prefer to leave the css intact...

Thank you.

Original comment by fpoliq...@gmail.com on 8 Dec 2010 at 8:24

GoogleCodeExporter commented 9 years ago

I suppose one could subclass the org.apache.xml.serialize.XHTMLSerializer class 
to act on CDATA according to a switch - but I'm not sure how easy that will be 
from a SAX approach. I will look at any patches submitted, though.

Original comment by arshan.d...@gmail.com on 14 Dec 2010 at 3:16

rjatkins / owaspantisamy

Error parsing <style> contents containing CDATA #30