superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.67k stars 311 forks source link

[bug] HTML sanitizer allows images #2097

Closed VyrCossont closed 1 year ago

VyrCossont commented 1 year ago

Describe the bug with a clear and concise description of what the bug is.

GtS does not sanitize received HTML aggressively enough to remove images. The GtS public web interface, and clients that don't do their own HTML sanitization (like Semaphore) will then display those images, and are thus vulnerable to user tracking.

Originally reported by @mirabilos at https://toot.mirbsd.org/@mirabilos/statuses/01H7GWS60V4J5X59AEKKRKXN5G

What's your GoToSocial Version?

GoToSocial 0.10.0 git-89ee9d5

GoToSocial Arch

all

What happened?

The HTML sanitizer GtS uses is BlueMonday's UGC policy: https://github.com/superseriousbusiness/gotosocial/blob/main/internal/text/sanitize.go#L33

This intentionally allows images through: https://github.com/microcosm-cc/bluemonday/blob/main/policies.go#L250

What you expected to happen?

There should be no <img> tags in HTML that GtS displays on the public web interface or sends to clients through the API (with the exception of GtS-inserted emoji on the public web interface).

How to reproduce it?

https://toot.mirbsd.org/@mirabilos/statuses/01H7GWS60V4J5X59AEKKRKXN5G is a test post that contains an embedded image

https://benjojo.co.uk/u/benjojo/h/QQ8qvtsM2Xxt2cnwJy is another test post on a Honk instance

Anything else we need to know?

Starting from BlueMonday's default-deny-everything strict policy and adding known-safe elements, attributes, and protocols seems preferable. See Mastodon's sanitization docs and SwiftSoup's basic safelist.

mirabilos commented 1 year ago

There should be no <img> tags in HTML that GtS displays on the public web interface or sends to clients through the API

Some other instances seem to keep the img tags but change them to links to the own instance. I’m seeing at least two variants of this:

The latter seems to be the most sensible, but downloading all images when the posts are coming in seems to be problematic as well (DDoS potential, and images may possibly be never needed). If it’s possible to defer that loading and storing to when it’ll first be needed, that seems to be the best option to me.

Other HTML elements that cause the browser to access external resources without user interaction should also be inspected (I assume/hope script is already out, and link, but I’ve not got a full overview of all tags that can do these things… object used to be used for SVG some years ago, unsure if this is still present, etc).

This is potentially a security issue (privacy exposure).

Sorry, I was going to report this tomorrow, I need a night now.

tsmethurst commented 1 year ago

Ah thanks, we'll fix this before next release.

tsmethurst commented 1 year ago

Re: your other concerns, script (among other things) would be caught by the go html templating so it wouldn't reach your browser, and SQL injection and stuff like that is shielded against by our database interface. But the above change also updates our normalization to just remove anything dodgy as soon as we get it, rather than relying on templating and clients to filter it out. And also updates our sanitizer to remove inline img tags!

mirabilos commented 1 year ago

OK, thanks.

I was wrong about Friendica, while I do get two requests from each instance, they are a HEAD and a GET, so they seem to download it.

Pleroma/Akkoma seem to merely proxy the request but it didn’t get enough views for me to be sure on that. Someone reports that they can cache-proxy if enabled.

Husky (Tusky fork) seems to drop(?) the image, only a small turquoise square shows up according to someone in that thread.

And I got…

Honk auto downloads attachments (including <img> tags) and re-hosts them (basically forever).

… along with a comment that GtS is “particularity tricky to federate with”.

This (downloading and keeping them) is something I’d love to see long-term (stripping is ok short-term, of course). Perhaps when the status is first requested, initiate the download and return the status with the /fileserver/ link where the image will eventually be, and in the fileserver, if the download is still pending, wait / keep the connection open, maybe, for a while?

tsmethurst commented 1 year ago

Yeah, that's likely what we'll end up doing, but only if the embedded image shares a domain with the account doing the embedding, or something :thinking: We'll have a think about it.