microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.08k stars 178 forks source link

Paragraph sanitization (e.g. img.alt) is too restrictive, disallows punctuation #158

Open palant opened 1 year ago

palant commented 1 year ago

This regexp is used to validate alt text of images. It disallows common punctuation, which causes issues when alt text is copied from news articles or source code listings for example. The result is alt attribute being dropped, rendering the image inaccessible to vision impaired people. And the text author is unlikely to even notice the issue, as visually the result seems just fine.

Subset of common symbols (some used in non-English languages) currently forbidden by this regular expression: "„“”‘’«»#$§%‰&*+±–—:;=?‽¡¿@{}|~…°®™.

I’m not sure I understand the purpose of restricting to a specific character set here, as opposed to properly escaping special characters (which I believe bluemonday does automatically). Is the concern that the contents of the alt or title attribute might be taken as the HTML source of some pop-up? Wouldn’t it make more sense to blacklist only angle brackets then?