mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.52k stars 198 forks source link

Sanitizing SVG tag contents against XXS/XXE #409

Open ImrePyhvel opened 1 year ago

ImrePyhvel commented 1 year ago

SVG images could be embedded inside html and include XSS similar to html, ex:

<svg xmlns=“http://www.w3.org/1999/svg“>
    <script>alert(1)</script>
</svg>

.. and many other nuisances (ex, see: The Image that called me.pdf, slide 11. Then there are also XXE attacks via <!ENTITY> declarations ,ex "Billion Laughs". Just like by spec HTML, while some SVG content may be harmful, it can be a legitimate part of HTML content, if safe. In my case, a required feature.

I'd imagine HtmlSanitizer would be able to traverse and clean up the SVG tag content just as well as HTML, but would require a significantly different configuration regarding safe elements/attributes compared to the default conf. Could HtmlSanitizer be used for SVG? Are there any known blockers?

Since SVG contains a ton of safe tags and attributes, it would be cool to have (optional) by-SVG-standard configuration preset for SVG (standalone or inside HTML)? A shared implementation would be more secure compared to each rolling their own.

This seems to chime in with the talk on presets in #350 .

tiesont commented 1 year ago

Could HtmlSanitizer be used for SVG? Are there any known blockers?

I think the answer to both of those questions are: it depends on AngleSharp. AngleSharp does the actual markup parsing, so if it can handle XML that isn't XHTML, I'd imagine it wouldn't take much to wrangle it into parsing SVG.

In fact, it does look like AngleSharp supports SVG processing: https://github.com/AngleSharp/AngleSharp/issues/886

So something like this example might work for you: https://github.com/mganss/HtmlSanitizer/wiki/Examples#ex3-replacing-the-default-formatter - this assumes you want parse the SVG separately from anything else.

mganss commented 1 year ago

AFAICT this use case is already possible as discussed in #287.

@ImrePyhvel Do you think there currently is potential for malicious markup to slip through in the default configuration? Do you have examples?

Re presets: Are there known allow lists for SVG elements? How do other sanitizers handle this?

ImrePyhvel commented 1 year ago

Yes, #287 hints the use case is possible, though the path of blindly trust the entire SVG tag content is not ok for me. Explicitly configuring allowedTags and allowedAttributes with XHtmlFormatter trick should be doable but requires enough knowledge about SVG and possible threats. Haven't had time yet to test blocking XXE and Billion-laughs-likes, though.

@mganss AFAIK the current default HtmlSanitizer configuration seems perfectly safe as it is stripping it out everything from svg root. this also prevents hopping into SVG XML-like behavior inside HTML.

I'm not too familiar with SVG or SVG-based attacks, which is why I'm reluctant to build an SVG whitelist myself and would prefer to find some community-verified whitelist. Also, I have a feeling the possible HTML-in-XML-in-HTML mess can cause all kinds of funky cases that a simple tag/attribute whitelist may require cutting big SVG features just to be on the safe side.

The key would be to get a comprehensive list of threat examples to sanitize against. Some things to consider:

Regarding existing tools and whitelists, I found surprisingly little. More promising leads:

ImrePyhvel commented 1 year ago

Did some testing..

Testfile

<?xml version="1.0" encoding="UTF-8"?>
<!--comment -->
<!DOCTYPE text [
    <!ENTITY xxe SYSTEM "file:///etc/hostname">
    <!ENTITY % data SYSTEM "php://filter/convert.base64-encode/resource=/etc/hostname">
    <!ENTITY % param1 SYSTEM "ftp://example.org:2121/">
    <!ENTITY lol1 "laugh">
    <!ENTITY lol2 "&lol1;&lol1;&lol1;">
    <!ENTITY lol "&lol2;&lol2;&lol2;">
]>  
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ev="http://www.w3.org/2001/xmlevents" onload="alert('loadSvg')">
    <g fill="white" stroke="green" stroke-width="5" onload="alert(0)">
        <circle cx="40" cy="40" r="25" />
        <circle cx="60" cy="60" r="25" />
    </g>

    <text font-size="16" x ="20" y="20">&lol;</text>
    <rect width="10" height="100">
        <animate xlink:href="alert(1)" attributeName="rx" values="0;5;0" dur="10s" repeatCount="indefinite"/>
        <set attributeName="onmouseover" to="alert(3)"/>
    </rect>
    <foreignObject xlink:href="alert(2)">
        <b> some html </b>
    </foreignObject>    
    <handler ev:event="load">alert(4)</handler>

    <script>
        alert('script')
    </script>

    <![CDATA[<SCRIPT>alert(5)</SCRIPT>]]>

    <style>
        svg { background-url : "javascript:alert(css)"; }
        <SCRIPT>alert(style)</SCRIPT>
    </style>
</svg>

I could not trigger the xlink:href and many other shown alerts as XSS, but retained them in sample file just in case. They should be removed to be on the safe side.

The code

Sanitizing code using whitelist from https://github.com/darylldoyle/svg-sanitizer:

var sanitizer = new HtmlSanitizer(
    allowedTags: AllowedSvgTags,
    allowedAttributes: AllowedSvgAttributes,
    allowedCssProperties: HtmlSanitizer.DefaultAllowedCssProperties,
    allowedSchemes: Enumerable.Empty<String>()
 );
var formatter = AngleSharp.Xhtml.XhtmlMarkupFormatter.Instance;
var sanitizedSvg = sanitizer.Sanitize(svg, outputFormatter: formatter);

The result

I removed manually a lot of empty whitespace left behind for brevity:

]&gt;
<svg xmlns="http://www.w3.org/2000/svg">
    <g fill="white" stroke="green" stroke-width="5">
        <circle cx="40" cy="40" r="25" />
        <circle cx="60" cy="60" r="25" />
    </g>

    <text font-size="16" x="20" y="20">&amp;lol;</text>
    <rect width="10" height="100">
    </rect>
    &lt;SCRIPT&gt;alert(5)&lt;/SCRIPT&gt;
</svg>

This works ok-ish, but it treats the input as html and hence:

I could remove <!doctype> cleanly using a preprocessing step, but that is not ideal as it causes to parse the file twice. For the sake of anti-XXE, I think it is ok to lose doctype entirely from SVG file, though it would be preferable to lose just <!Entity> elements. Could be improved if this starts to matter..

Also saw that Anglesharp.XML can also parse simpler SVG, but for some reason it choked with parser error when doctype was included. Anyway, it should not matter as there does not seem to be a way to reuse the already parsed ISvgDocument/IDocument for HtmlSanitizer.

To sum it up

it kinda works and checking with some real world samples some files will be hit by stripped features implemented as xml extensions in separate xmlns, including

Whitelisting some tags/attributes from xmlns extensions is trickier as xmlns could be set up with any alias name + it is difficult to assess if/which extension tags could be considered unsafe in their reader. The scope would become a mess easily.