tgalopin / html-sanitizer

Sanitize untrustworthy HTML user input
MIT License
390 stars 41 forks source link

Advanced filters / transforming nested node structures #39

Open fabianmichael opened 4 years ago

fabianmichael commented 4 years ago

Hi, thanks for developing this package. I’m currently working on a comments and webmention plugin and explored different possibilites of sanitizing HTML with PHP. After evaluating all available options, you project seems to be a great fit!

In my use-case, I do not need to sanitize HTML, but also apply some aggressive filtering like e.g. removing all class attributes, to ensure that the HTML of comments does not interfere with page styles of a blog and also would like to add a few custom cleanup routes.

The foremost important things are:

  1. Make sure, that every top-level node in given HTML fragent is wrapped by a <p>, <blockquote> or other allowed top-level element.
  2. <a> elements without an href attribute should be removed.
  3. <br> at the beginnging of an inline-element should be moved before the start of that element, otherwise this would break formatting of external links by prepending an icon. <br> elements in the middle of the link text should be preserved. Same for <br> elements at the end of and inline element.
  4. <h1>Headline</h1> etc. should be transformed into something like <p><strong>Headline</strong></p> to keep some kind of formatting, but preventing comments from messing with the document outline of the containing document.

As my project also handles webmentions, it has to deal with any possible kind of HTML markup, so I cannot use a strict whitelist to handle direct user input and forbid elements like e.g. headings in the first place.

These transforms are relatively easy to implement with PHPs native DOM Library, but after some hours of tinkering around with html-sanitizer, I could not find a solution for these particular requirements. I understand how to transform a single node, but that’s it. Can you please give me a hint, where I could hook into the DOM tree to do these kind of transformations or is there maybe a more elegant way?

tgalopin commented 4 years ago

Hello Fabian,

I started to write this message 3 days ago but didn't have time to finish before now :) .

This is an interesting use-case indeed, it's nice to think about how this library could work for you, that's usually how great improvements happen :) .

First, here is my view on your needs:

1/ By default, almost all HTML attributes are removed, thus you won't need to do anything more than what's already configured to remove the class and similar attributes. ht-mlsanitizer is meant to return a HTML-only (CSS-free, JS-free) version of your document. It removes pretty much everything as it build a new DOM tree in parallel of the visiting of the original one, importing only what it needs.

2/ Your points 2, 3 and 4 should be feasible without too much difficulty, I'll explain afterwards. In point 4, what are you referring to by preventing comments? If you are talking about HTML comment, note that this library remove them altogether for security reasons, as it's very difficult to predict browsers behavior with comments.

3/ Your first 1 about always adding a text container is interesting: due to an optimization I made to increase performance, it's not necessarily easy to that this for now (https://github.com/tgalopin/html-sanitizer/blob/master/src/DomVisitor.php#L83). Let's discuss it once the other points are addressed.


To give you a bit of context and perhaps help you do what you need with the library: html-sanitizer is actually "mostly" a node visitor. What you do during the visit is pretty much up to you.

First, the library parses the provided content as HTML (using html5-php) to build a tree with its elements. Then, it uses the DomVisitor to visits recursively all the nodes of the tree (on each iteration, it enters a node, visits its children and then leave the node). During this visit, by default, it creates and stores a new tree in what's called the "cursor". This new tree is only constituted of the safe elements of the original tree (ie. what was intentionally from it). Once the visit is finished, html5-php is used again to dump the safe tree as HTML.

This behavior allows you to do pretty much whatever you want on the tree, even if some of your ideas are a bit more complex than others to implement. Everything you need to do is related to properly visiting the original tree and acting accordingly.

To start implementing something, I would recommend you to instantiate the DomVisitor yourself:

$visitor = new DomVisitor([new YourNodeVisitor()]);
$visitor->visit((new MastermindsParser())->parse('html content'));

This will allow you to understand how to visitor works and how to implement a node visitor.

A NodeVisitor implements the interface HtmlSanitizer\Visitor\NodeVisitorInterface: on each node from the tree, your node visitor supports method will be called to check whether your visitor should be used, and if so your method enterNode will be called when the node is entered. The DomVisitor will then visit the children, including other nodes you may support (meaning you could get multiple enterNode calls in a row, for different nodes). Once the children have been visited, leaveNode is called.

If I were you I would start by your points 2 and 4: they are a bit easier as they manipulate the tree locally. To implement them, you can match <a> and <hX> nodes, and imports the updated version in the cursor. Have a look at HtmlSanitizer\Visitor\HasChildrenNodeVisitorTrait to see how the methods are implemented for this behavior :) .

And of course, don't hesitate to ask me here if you need more help!

fabianmichael commented 4 years ago

@tgalopin Wow, thanks for you extensive answer! 🤘 To be honest, I had a working solution for my project before, based on HTML Purifier and was mostly looking for alternatives because I’d prefer to have an MIT-licensed library at hand. I also like, how you library feels a bit more up-to-date and thus more familiar in terms of code-style.

For my usecase, a few other issues came up in the meantime. Mostly, that HTML Purifier understands the difference between block and inline elements and is can fix invalid nesting pretty well by default. I’ll stick with Purifier for my current use case, but think I’ll dig a bit seeper into html-sanitizer the next time, I have to deal with HTML transformations.