tgalopin / html-sanitizer

Sanitize untrustworthy HTML user input
MIT License
390 stars 42 forks source link

How to Sanitize with Unknown incoming HTML tags [Can't whitelist]? #44

Open rohitcoder opened 4 years ago

rohitcoder commented 4 years ago

Hi @tgalopin ,

Thanks for this great Sanitizer!

So, we are in a situation where we can't WhiteList Tags, we are using CKeditor5 and using that our user will generate any kind of content, and those content may contain variety of custom tags For example have a look at this.

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>2</mn><mn>3</mn></mfrac><mo>+</mo><msqrt><mn>43</mn>
<mo>/</mo><mn>34</mn><mo>∞</mo>
<mi mathvariant="normal">π</mi><mo>∆</mo></msqrt></math>&nbsp;<math class="wrs_chemistry" xmlns="http://www.w3.org/1998/Math/MathML">
<msubsup><mi>mol</mi><mmultiscripts><msup><mmultiscripts><maction actiontype="argument">
<mrow>&nbsp;</mrow></maction><mprescripts>&nbsp;</mprescripts>
<mroot><msqrt>&nbsp;</msqrt><mrow>&nbsp;</mrow></mroot>
<none>&nbsp;</none></mmultiscripts>
<mrow>&nbsp;</mrow></msup><mrow>&nbsp;</mrow><none>&nbsp;</none><mprescripts>&nbsp;</mprescripts><mrow>&nbsp;</mrow><mrow>&nbsp;</mrow></mmultiscripts><mrow>&nbsp;</mrow></msubsup>
</math></p><p>&nbsp;</p><pre><code class="language-php">&lt;?php
echo "Hello world";
?&gt;</code></pre>

Now, I want to store this whole text in my DB, but i don't want any incoming XSS/SQLi scripts or tags. How it can be done? I was going through internal codebase of this project, and it seems i can add my own tags or i can introduce my own Custom Extension, but i would need to Whitelist tags, and attributes. How it can be done without whitelisting such tags?

rohitcoder commented 4 years ago

Is there any way to whitelist around 100 tags and 200 attributes in bulk without creating extension for that?

I want to whitelist these all tags

['abs', 'and', 'annotation', 'annotation-xml', 'apply', 'approx', 'arccos', 'arccosh', 'arccot', 'arccoth', 'arccsc', 'arccsch', 'arcsec', 'arcsech', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'arg', 'bind', 'bvar', 'card', 'cartesianproduct', 'cbytes', 'ceiling', 'cerror', 'ci', 'cn', 'codomain', 'complexes', 'compose', 'condition', 'conjugate', 'cos', 'cosh', 'cot', 'coth', 'cs', 'csc', 'csch', 'csymbol', 'curl', 'declare', 'degree', 'determinant', 'diff', 'divergence', 'divide', 'domain', 'domainofapplication', 'emptyset', 'eq', 'equivalent', 'eulergamma', 'exists', 'exp', 'exponentiale', 'factorial', 'factorof', 'false', 'floor', 'fn', 'forall', 'gcd', 'geq', 'grad', 'gt', 'ident', 'image', 'imaginary', 'imaginaryi', 'implies', 'in', 'infinity', 'int', 'integers', 'intersect', 'interval', 'inverse', 'lambda', 'laplacian', 'lcm', 'leq', 'limit', 'list', 'ln', 'log', 'logbase', 'lowlimit', 'lt', 'maction', 'maligngroup', 'malignmark', 'matrix', 'matrixrow', 'max', 'mean', 'median', 'menclose', 'merror', 'mfenced', 'mfrac', 'mglyph', 'mi', 'min', 'minus', 'mlabeledtr', 'mlongdiv', 'mmultiscripts', 'mn', 'mo', 'mode', 'moment', 'momentabout', 'mover', 'mpadded', 'mphantom', 'mprescripts', 'mroot', 'mrow', 'ms', 'mscarries', 'mscarry', 'msgroup', 'msline', 'mspace', 'msqrt', 'msrow', 'mstack', 'mstyle', 'msub', 'msubsup', 'msup', 'mtable', 'mtd', 'mtext', 'mtr', 'munder', 'munderover', 'naturalnumbers', 'neq', 'none', 'not', 'notanumber', 'notin', 'notprsubset', 'notsubset', 'or', 'otherwise', 'outerproduct', 'partialdiff', 'pi', 'piece', 'piecewise', 'plus', 'power', 'primes', 'product', 'prsubset', 'quotient', 'rationals', 'real', 'reals', 'reln', 'rem', 'root', 'scalarproduct', 'sdev', 'sec', 'sech', 'selector', 'semantics', 'sep', 'set', 'setdiff', 'share', 'sin', 'sinh', 'subset', 'sum', 'tan', 'tanh', 'tendsto', 'times', 'transpose', 'true', 'union', 'uplimit', 'variance', 'vector', 'vectorproduct', 'xor'];

and these all attributes

 ['columnalign', 'accentunder', /*'src',*/ 'subscriptshift', 'infixlinebreakstyle', 'mslinethickness', 'close', 'rightoverhang', 'longdivstyle', 'linebreak', 'bevelled', 'overflow', 'xml:lang', 'leftoverhang', 'columnwidth', 'equalcolumns', 'id', 'fontfamily', 'separators', 'minlabelspacing', 'scriptlevel', 'height', 'occurrence', 'stackalign', 'color', 'cdgroup', 'veryverythickmathspace', 'rowspacing', 'name', 'other', 'order', 'macros', 'veryverythinmathspace', 'notation', 'columnspan', 'fence', 'valign', 'maxsize', 'indentshiftfirst', 'lspace', 'lquote', 'position', 'crossout', 'equalrows', 'altimg-height', 'voffset', 'dir', 'frame', 'denomalign', '%XLINK.prefix;:href', 'actiontype', 'mode', 'display', 'linethickness', 'maxwidth', 'length', 'columnlines', 'movablelimits', 'lineleading', 'scriptsizemultiplier', 'linebreakstyle', 'charalign', 'charspacing', /*'alt',*/ /*'href',*/ 'rquote', 'altimg', 'verythinmathspace', 'rowlines', 'accent', 'groupalign', 'separator', 'mathbackground', 'nargs', 'indenttarget', 'verythickmathspace', 'mathsize', 'symmetric', 'edge', 'open', 'side', 'thinmathspace', 'fontstyle', 'encoding', 'selection', 'columnspacing', 'decimalpoint', /*'style',*/ 'stretchy', 'cd', 'scriptminsize', 'width', 'indentalignfirst', 'shift', 'index', 'linebreakmultchar', 'xml:space', 'scope', 'largeop', 'alttext', 'altimg-valign', 'base', 'closure', 'minsize', 'indentalign', 'framespacing', 'definitionURL', 'rspace', 'numalign', 'fontweight', 'class', 'rowalign', 'form', 'alignmentscope', 'align', '%XLINK.prefix;:type', 'depth', 'fontsize', 'type', 'background', 'displaystyle', 'superscriptshift', 'mediummathspace', 'rowspan', 'indentshiftlast', 'location', 'xref', 'altimg-width', 'thickmathspace', 'indentalignlast', 'mathcolor', 'indentshift', 'mathvariant', 'xmlns'];

Specifically i am looking for support for this plugin https://github.com/wiris/ckeditor5-mathml/

tgalopin commented 4 years ago

Hello @rohitcoder !

This library works on a whiteliste basis, meaning that you will have to enable every tag and attribute that you want to keep, either using configuration or extensions.

If you wish to be very generic, you can use something like http://htmlpurifier.org which is more open than HtmlSanitizer. Note however that using htmlpurifier with default options means you may receive HTML/CSS that could break your display if not handled properly (position absolute, large images, ...).

HtmlSanitizer is very useful when you want to allow only a specific list of tags and attributes and ensure you won't get anything else in the output. This allows you to be safe regarding security, but also to exhaustively handle all the possible tags in your CSS.

In your specific use-case, I think you can create an extension that would list all the tags you have an associate them to a dynamic node. Something like:

class MathMlExtension implements ExtensionInterface
{
    public function getName(): string
    {
        return 'mathml';
    }

    public function createNodeVisitors(array $config = []): array
    {
        // Iterate over all the MathML tags to register them here, for instance using 
        // only 2 dynamic visitors: one for tags with children, the other for tags without.
        $tags = [];
        foreach (self::$tagNames as $name) {
            $tags[$name] = new MyVisitor($name);
        }

        return $tags;
    }
}
rohitcoder commented 4 years ago

Hi @tgalopin

I am working on getting it done on my side (Getting some problems, and there is little bit typo and multiple confusions in docs "Create Extensions"), I would suggest you to implement this "Mathml" Extension as optional extension by default in this Sanitizer.

Maintainers of MathMl team were looking for some good sanitizers few years back but i think still, they don't have any good working sanitizer you can have a look at this thread https://github.com/ezyang/htmlpurifier/pull/200

It would be really good, if you implement this as optional like other Available extensions, MathMl is very famous tool and giving add-on support to that tool will be a great idea for this sanitizer.

rohitcoder commented 4 years ago

Hi @tgalopin,

I forked this Repo, and added "MathMl" Extension let me know, if you are happy with pull request? You can see my commit here https://github.com/rohitcoder/html-sanitizer/commit/602ed313026001238f5616d44882765fe8068a3d (Really big commit..).

If you think there is any security issue with my commit, let me know :) , I allowed almost all custom attributes on each custom tags which is used by MathMl, but i also removed "xlink:href" custom attribute from all custom tags because this is prone to XSS as someone reported in H1 https://hackerone.com/reports/502926.

However, i added sanitizer to mactionNodeVisitor because it's known it is used to put some hyperlinks there, so now it can be sanitized and for rest of all tags this attribute is removed by default. You can see this here - https://github.com/rohitcoder/html-sanitizer/blob/d41afa006d563359d075b4745e6b8350e220508e/src/Extension/MathMl/NodeVisitor/mactionNodeVisitor.php

For, now i am adding details in my Repo that this is fork is available with "MathMl" it would be good, if you also make this available from your own repo, as i mentioned early it is one of the most famous math rendering tool in WYSWYG editors.

BTW, i also mentioned your docs are little bit confusing and there is one typo in https://github.com/tgalopin/html-sanitizer/edit/master/docs/2-creating-an-extension-to-allow-custom-tags.md (LINE 136) i think you have typo there instead of } it should be ]

$sanitizer = $builder->build([
    'extensions' => ['basic', 'list', 'my-tag'],
} <== Here );

Again thanks for this awesome sanitizer!

tgalopin commented 4 years ago

Thanks!

Don't hesitate to open a PR, I'll have a look!

Also: you can open a PR for the doc typo, I'd be happy to merge it!