tcowans / owasp-java-html-sanitizer

Automatically exported from code.google.com/p/owasp-java-html-sanitizer
Other
1 stars 0 forks source link

Moving style attributes to font tag changes rendering #9

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Consider this HTML:
<table style="color: rgb(0, 0, 0); font-family: Arial, Geneva, sans-serif;">
<tbody>
<tr>
<th>Column One</th><th>Column Two</th>
</tr>
<tr>
<td align="center" style="background-color: rgb(255, 255, 254);"><font 
size="2">Size 2</font></td>
<td align="center" style="background-color: rgb(255, 255, 254);"><font 
size="7">Size 7</font></td>
</tr>
</tbody>
</table>

If you display this in a browser, all the text inside the table renders in a 
sans-serif font.

2. Sanitize that HTML with allowStyling(). Some of the style attributes are 
moved to a font tag. This is the output:
<table>
<font face="Arial, Geneva, sans-serif" style="color:#000">
<tbody>
<tr>
<th>Column One</th>
<th>Column Two</th>
</tr>
<tr>
<td align="center"><font style="background-color:#fffffe"><font size="2">Size 
2</font></font></td>
<td align="center"><font style="background-color:#fffffe"><font size="7">Size 
7</font></font></td>
</tr>
</tbody>
</font>
</table>

If you view this in a browser, the table text is now rendered in serif instead 
of sans-serif.

What is the expected output? What do you see instead?
I think this is the expected output, given the design of the library. However, 
I question whether transforming style attributes by adding a font tag is really 
the "right" thing to do. Besides changing how the HTML is rendered, the font 
tag is deprecated in HTML 4.0 and is not supported in HTML 5. If the code is 
able to generate sanitized style attributes for use in the font tag, why not 
put those same style attributes in the original style attribute (in this case, 
on the table element)?

What version of the product are you using? On what operating system?
r135 on Windows 7, with Java 6.

Please provide any additional information below.

Original issue reported on code.google.com by danr...@gmail.com on 1 Feb 2013 at 4:47

GoogleCodeExporter commented 9 years ago
This is the expected output, but obviously, if the font tag is shuttled into 
some non-row content abyss, then it could be improved.

The design decision was made that way because CSS is such a large attack 
surface that anything we can do to limit attackers' ability to manipulate CSS 
leaves us in a more secure position.

Are there tags besides HTML where this is a problem?

If the styling handler detected that it was inside a table, thead, tbody, 
tfoot, rowgroup, colgroup, or tr element, and instead inserted the <font> 
element inside contained <td> and <th> elements would that help?

So instead of

<table>
<font face="Arial, Geneva, sans-serif" style="color:#000">
<tbody>
<tr>
<th>Column One</th>
<th>Column Two</th>
</tr>
<tr>
<td align="center"><font style="background-color:#fffffe"><font size="2">Size 
2</font></font></td>
<td align="center"><font style="background-color:#fffffe"><font size="7">Size 
7</font></font></td>
</tr>
</tbody>
</font>
</table>

you would get

<table style="color:#000">
<tbody>
<tr>
<th><font face="Arial, Geneva, sans-serif">Column One</th>
<th><font face="Arial, Geneva, sans-serif">Column Two</th>
</tr>
<tr>
<td align="center"><font style="background-color:#fffffe" face="Arial, Geneva, 
sans-serif" size="2">Size 2</font></td>
<td align="center"><font style="background-color:#fffffe" face="Arial, Geneva, 
sans-serif" size="7">Size 7</font></td>
</tr>
</tbody>
</table>

Original comment by mikesamuel@gmail.com on 2 Feb 2013 at 6:25

GoogleCodeExporter commented 9 years ago
White-listing a set of fonts from 
http://webdesign.about.com/od/fonts/qt/web-safe-fonts.htm could keep the size 
of the sanitized output much closer to the size of the input for large tables, 
but I'm loathe to do anything that makes some fonts work with tables and others 
not.

Original comment by mikesamuel@gmail.com on 2 Feb 2013 at 11:23

GoogleCodeExporter commented 9 years ago
I'm actually not that much of an HTML expert; this is just the problem I saw in 
my first test. I assume there will be other constructs that are similarly 
problematic, but I'm not in a position to enumerate them.

I still think you could preserve some style attributes without increasing the 
attack surface. Currently when you see 
style="color: rgb(0, 0, 0); font-family: Arial, Geneva, sans-serif;"
you parse that into some data structures that are subsequently used in a font 
tag. I'm suggesting that instead of emitting the font tag, you could remove the 
user-supplied style attributes, and replace it with a new style attribute that 
you generate from known constructs. So in my example, you would get:
<table style="color:#000; font-family: Arial, Geneva, sans-serif;">

I'm concerned that any other approach is likely to impact rendering in ways 
that aren't obvious to end users. It might not be a big deal if someone is 
typing markup into a wiki, but if your app allows someone to paste in a chunk 
of HTML (copied from a web page, or perhaps even a word processing 
application), then users will expect the rendering to look as similar to the 
original as possible.

Original comment by danr...@gmail.com on 4 Feb 2013 at 8:08

GoogleCodeExporter commented 9 years ago
> if your app allows someone to paste in a chunk of HTML (copied from a web 
page, or perhaps even a word processing application), then users will expect 
the rendering to look as similar to the original as possible.

agreed

> instead of emitting the font tag, you could remove the user-supplied style 
attributes, and replace it with a new style attribute that you generate from 
known constructs

I do something like that already, just not for font names.
http://code.google.com/p/owasp-java-html-sanitizer/source/browse/trunk/src/tests
/org/owasp/html/StylingPolicyTest.java

http://code.google.com/p/owasp-java-html-sanitizer/source/browse/trunk/src/main/
org/owasp/html/StylingPolicy.java#320
Only font family, align, and style are put on the <font> tag.  The latter two 
are easy to whitelist.

http://www.w3.org/TR/CSS21/fonts.html#value-def-family-name says
> Font family names must either be given quoted as strings, or unquoted as a 
sequence of one or more identifiers. This means most punctuation characters and 
digits at the start of each token must be escaped in unquoted font family names.

I'll see if I can come up with a white-list of generic font names (e.g. 
sans-serif), and then any non-generic font name that contains only ASCII alpha 
numerics and spaces gets quoted and put in a CSS style tag.  Anything with 
punctuation like the examples below from the CSS spec I'll either reject or 
maybe shove in a <font face>.

    font-family: Ahem!, sans-serif;
    font-family: test@foo, sans-serif;
    font-family: #POUND, sans-serif;
    font-family: Hawaii 5-0, sans-serif;

I'll test whether vendor prefixed  ones like -webkit-small-control survive 
quoting.  Allowing untrusted code to spoof OS controls might enable trusted 
path violation anyway.

Original comment by mikesamuel@gmail.com on 5 Feb 2013 at 10:37

GoogleCodeExporter commented 9 years ago
http://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=147 fixes 
this issue.  Let me know if that works for you and I'll cut a push to maven 
central.

Original comment by mikesamuel@gmail.com on 12 Feb 2013 at 7:14

GoogleCodeExporter commented 9 years ago
Thanks Mike, this does in fact fix the specific problem I reported.

(I'm still concerned that the fidelity of the transformed HTML will be 
insufficient unless a lot more CSS constructs are supported, though.)

Original comment by danr...@gmail.com on 2 Apr 2013 at 10:26

GoogleCodeExporter commented 9 years ago
Great.  I'll make sure this is on maven current.

Re CSS, my current (full-time) project involves generating sanitizers (and 
other tools) from grammars annotated with schema constraints so hopefully 
https://code.google.com/p/noinject/source/browse/mlsrc/test-files/san/css/gramma
r.g will soon serve as the basis for a more flexible way to sanitize CSS.  That 
grammar is very drafty and written against an obsolete version of the CSS3 
spec, but the general shape will probably remain the same.

Original comment by mikesamuel@gmail.com on 5 Apr 2013 at 4:26

GoogleCodeExporter commented 9 years ago
r198 includes a significant rewrite of the CSS sanitizer which recognizes a 
larger set of CSS properties and no longer introduces <font> elements so should 
work well with tables.

Original comment by mikesamuel@gmail.com on 24 Jul 2013 at 3:55