microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.12k stars 176 forks source link

Data URIs with whitespaces #122

Closed KN4CK3R closed 3 years ago

KN4CK3R commented 3 years ago

Reason for this issue: for example Jupyter nbconvert creates "invalid" image data uris.

With #51 / #52 the data uri was permitted to contain linebreaks. With 95fbd036 Go url.Parse changed its behaviour to disallow control characters. You "fixed" the test case which tests for whitespaces in data uris (now the test tests the same thing twice). The W3C validator rejects a data uri which contains linebreaks but every browser accepts them.

The MDN states that

If the data contains characters defined in RFC 3986 as reserved characters, or contains space characters, newline characters, or other non-printing characters, those characters must be percent-encoded (aka “URL-encoded”).

It would be nice if validURL(rawurl string) would remove/escape whitespaces in data uris.

You could argue that

It is not the job of bluemonday to fix your bad HTML

but https://github.com/microcosm-cc/bluemonday/blob/cb614698bfbcbdca4b8e177e17b9a9aaf65109d7/sanitize.go#L852 already removes some whitespaces so it could just remove some more.

buro9 commented 3 years ago

I see your point but am hesitant. How should I prevent double-encoding if something was already partially encoded? A client that fails to encode spaces might have already encoded other entities, and perhaps it is intentional that some entities are encoded twice (i.e. a data-uri that contains HTML that itself contains data-uris) and how would I handle that?

Trimming whitespace in invalid contexts is trivial, but getting escaping right without knowing the intent of the server is hard.

Removing white space within content is problematic as it can affect layout. HTML is permitted in data-uris and HTML is whitespace sensitive (for at least a single whitespace, but within some tags all whitespace is valid).

If you've got a PR that shows the right way to do this then I'll seriously consider it, otherwise the risks of breaking good input outweigh in my view the benefit of fixing this for scenarios where a server creates broken issues.

And you've quoted "invalid" and "broken"... but now I don't know whether you mean Jupyter is doing something wrong, or you're being sarcastic, or whether there is an issue with this library, or with a client that uses nbconvert, or with something else 🤷‍♀️. Test cases and examples are more useful here.

KN4CK3R commented 3 years ago

I created a PR which strips linebreaks from base64 encoded data. It should be safe in this case and fixes the image errors.

buro9 commented 3 years ago

Thanks, merged it.

You kept it simple and avoided all of the implications for other data-uri content... so I'm happy with that and the contribution is welcomed.