thesephist / zone

A URL shortener / note sharing service.
https://linus.zone
MIT License
25 stars 12 forks source link

Sanitize potentially dangerous HTML in markdown notes #4

Closed thesephist closed 5 years ago

thesephist commented 5 years ago

Add a check to attempt to sanitize notes that contain certain dangerous HTML tags.

This isn't a perfect sanitizer. Notably, it fails to correctly parse more nuanced HTML, like:

<script>console.log('hi </script>');</script>

... which the sanitizer will convert to ');</script>, because it thinks the first occurrence of </script> is closing the tag, when it's a part of JS. But I'm not sure how to get around these edge cases without creating a full-blown parser for HTML/JS/CSS syntax.

On the bright side, the algorithm in the PR errs on the safe side and mangles these dangerous tags even if the sanitization result is not perfect, since all dangerous opening tags are removed, even if they're potentially closed early. So I feel ok with the imperfect algorithm, since I don't really care if dangerous HTML is rendered beautifully.

I also considered using third party HTML sanitizers, like sanitize-html, but even most other lightweight sanitizers seem to fail on the above example. This would be a good thing to add unit tests for, so I might do that if I get some time....

This PR aims to resolve #2.