Sanitizing SVG tag contents against XXS/XXE

SVG images could be embedded inside html and include XSS similar to html, ex:

<svg xmlns=“http://www.w3.org/1999/svg“>
    <script>alert(1)</script>
</svg>

.. and many other nuisances (ex, see: The Image that called me.pdf, slide 11. Then there are also XXE attacks via <!ENTITY> declarations ,ex "Billion Laughs". Just like by spec HTML, while some SVG content may be harmful, it can be a legitimate part of HTML content, if safe. In my case, a required feature.

I'd imagine HtmlSanitizer would be able to traverse and clean up the SVG tag content just as well as HTML, but would require a significantly different configuration regarding safe elements/attributes compared to the default conf. Could HtmlSanitizer be used for SVG? Are there any known blockers?

Since SVG contains a ton of safe tags and attributes, it would be cool to have (optional) by-SVG-standard configuration preset for SVG (standalone or inside HTML)? A shared implementation would be more secure compared to each rolling their own.

This seems to chime in with the talk on presets in #350 .

Could HtmlSanitizer be used for SVG? Are there any known blockers?

I think the answer to both of those questions are: it depends on AngleSharp. AngleSharp does the actual markup parsing, so if it can handle XML that isn't XHTML, I'd imagine it wouldn't take much to wrangle it into parsing SVG.

In fact, it does look like AngleSharp supports SVG processing: https://github.com/AngleSharp/AngleSharp/issues/886

So something like this example might work for you: https://github.com/mganss/HtmlSanitizer/wiki/Examples#ex3-replacing-the-default-formatter - this assumes you want parse the SVG separately from anything else.

AFAICT this use case is already possible as discussed in #287.

@ImrePyhvel Do you think there currently is potential for malicious markup to slip through in the default configuration? Do you have examples?

Re presets: Are there known allow lists for SVG elements? How do other sanitizers handle this?

Yes, #287 hints the use case is possible, though the path of blindly trust the entire SVG tag content is not ok for me. Explicitly configuring allowedTags and allowedAttributes with XHtmlFormatter trick should be doable but requires enough knowledge about SVG and possible threats. Haven't had time yet to test blocking XXE and Billion-laughs-likes, though.

@mganss AFAIK the current default HtmlSanitizer configuration seems perfectly safe as it is stripping it out everything from svg root. this also prevents hopping into SVG XML-like behavior inside HTML.

I'm not too familiar with SVG or SVG-based attacks, which is why I'm reluctant to build an SVG whitelist myself and would prefer to find some community-verified whitelist. Also, I have a feeling the possible HTML-in-XML-in-HTML mess can cause all kinds of funky cases that a simple tag/attribute whitelist may require cutting big SVG features just to be on the safe side.

The key would be to get a comprehensive list of threat examples to sanitize against. Some things to consider:

SVG standalone files vs inline SVG in HTML - seem to have different threat-sets.
It's more than simple XSS, for example as described in Anatomy of Scalable Vector Graphics (SVG) Attack Surface on the Web.
Some XSS samples to watch out for can be found by Ctrl-F SVG from Payloadallthethings XSS list.
SVG with CSS? - it would be safer to strip all external content.
https://www.blackhat.com/docs/us-14/materials/us-14-DeGraaf-SVG-Exploiting-Browsers-Without-Image-Parsing-Bugs.pdf
The Image that called me is old and browser behavior most likely have changed since, but examples shown inside do highlight that sanitizing SVG requires more than just stripping the obvious <script> tags.

Regarding existing tools and whitelists, I found surprisingly little. More promising leads:

https://heideri.ch/svgpurifier/SVGPurifier/index.php - no source code/whitelist. but could be obtained from author perhaps.
https://github.com/darylldoyle/svg-sanitizer - in PHP, but does provide code and some tag/attribute whitelist which could be used as a starting point.

Did some testing..

Testfile

<?xml version="1.0" encoding="UTF-8"?>
<!--comment -->
<!DOCTYPE text [
    <!ENTITY xxe SYSTEM "file:///etc/hostname">
    <!ENTITY % data SYSTEM "php://filter/convert.base64-encode/resource=/etc/hostname">
    <!ENTITY % param1 SYSTEM "ftp://example.org:2121/">
    <!ENTITY lol1 "laugh">
    <!ENTITY lol2 "&lol1;&lol1;&lol1;">
    <!ENTITY lol "&lol2;&lol2;&lol2;">
]>  
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ev="http://www.w3.org/2001/xmlevents" onload="alert('loadSvg')">
    <g fill="white" stroke="green" stroke-width="5" onload="alert(0)">
        <circle cx="40" cy="40" r="25" />
        <circle cx="60" cy="60" r="25" />
    </g>

    <text font-size="16" x ="20" y="20">&lol;</text>
    <rect width="10" height="100">
        <animate xlink:href="alert(1)" attributeName="rx" values="0;5;0" dur="10s" repeatCount="indefinite"/>
        <set attributeName="onmouseover" to="alert(3)"/>
    </rect>
    <foreignObject xlink:href="alert(2)">
        <b> some html </b>
    </foreignObject>    
    <handler ev:event="load">alert(4)</handler>

    <script>
        alert('script')
    </script>

    <![CDATA[<SCRIPT>alert(5)</SCRIPT>]]>

    <style>
        svg { background-url : "javascript:alert(css)"; }
        <SCRIPT>alert(style)</SCRIPT>
    </style>
</svg>

I could not trigger the xlink:href and many other shown alerts as XSS, but retained them in sample file just in case. They should be removed to be on the safe side.

The code

Sanitizing code using whitelist from https://github.com/darylldoyle/svg-sanitizer:

var sanitizer = new HtmlSanitizer(
    allowedTags: AllowedSvgTags,
    allowedAttributes: AllowedSvgAttributes,
    allowedCssProperties: HtmlSanitizer.DefaultAllowedCssProperties,
    allowedSchemes: Enumerable.Empty<String>()
 );
var formatter = AngleSharp.Xhtml.XhtmlMarkupFormatter.Instance;
var sanitizedSvg = sanitizer.Sanitize(svg, outputFormatter: formatter);

The result

I removed manually a lot of empty whitespace left behind for brevity:

]&gt;
<svg xmlns="http://www.w3.org/2000/svg">
    <g fill="white" stroke="green" stroke-width="5">
        <circle cx="40" cy="40" r="25" />
        <circle cx="60" cy="60" r="25" />
    </g>

    <text font-size="16" x="20" y="20">&amp;lol;</text>
    <rect width="10" height="100">
    </rect>
    &lt;SCRIPT&gt;alert(5)&lt;/SCRIPT&gt;
</svg>

This works ok-ish, but it treats the input as html and hence:

strips the <?xml .... Which is not a big deal but also not ideal.
does strip entire <!DOCTYPE ... >, including custom entities within, but keeps some ugly leftovers like ]> in output.
leaves behind a lot of whitespace around removed elements - does not matter much.

I could remove <!doctype> cleanly using a preprocessing step, but that is not ideal as it causes to parse the file twice. For the sake of anti-XXE, I think it is ok to lose doctype entirely from SVG file, though it would be preferable to lose just <!Entity> elements. Could be improved if this starts to matter..

Also saw that Anglesharp.XML can also parse simpler SVG, but for some reason it choked with parser error when doctype was included. Anyway, it should not matter as there does not seem to be a way to reuse the already parsed ISvgDocument/IDocument for HtmlSanitizer.

To sum it up

it kinda works and checking with some real world samples some files will be hit by stripped features implemented as xml extensions in separate xmlns, including

metadata (rdf)
licensing (cc, copyright may also be in comments)
embedded raster images (ex: <image xlink:href="data:image/png;base64, ... "/>

Whitelisting some tags/attributes from xmlns extensions is trickier as xmlns could be set up with any alias name + it is difficult to assess if/which extension tags could be considered unsafe in their reader. The scope would become a mess easily.

mganss / HtmlSanitizer