[bug] `&` not HTML escaped to `&` in Markdown-parsed HTML

superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.

https://docs.gotosocial.org

GNU Affero General Public License v3.0

3.57k stars 300 forks source link

[bug] `&` not HTML escaped to `&` in Markdown-parsed HTML #2612

Open tsmethurst opened 5 months ago

tsmethurst commented 5 months ago

As the title says, when we parse something from markdown we don't end up html-escaping any ampersands to & in resulting HTML.

We should try to squeeze this in to our markdown-to-html parsing and sanitization flow somewhere, without totally mangling any existing escapes, as not escaping the ampersand may cause issues with some HTML parsers (haven't seen any yet but you know how it is).

Credit to @mirabilos for noticing this: https://github.com/superseriousbusiness/gotosocial/issues/2610#issuecomment-1930525004

mirabilos commented 5 months ago

OK, this does affect normal tooting as well: test & test & test outside of accent-gravis code blocks becomes test & test & test

It has an even worse impact: the HTML for that (in the database) is <p>test & test & test</p>, so the & from Markdown (which should have become &amp;) actually became just &.

tsmethurst commented 5 months ago

Tbh, including & in your text is not really a "normal" toot, but yes. We'll fix this after 0.14.0 is out probably. Unless I'm misremembering, it's not really a trivial quick fix because of the order in which we need to handle sanitization, minimization, etc, so it will require a bit of time and focus to do it properly.

mirabilos commented 5 months ago

tobi dixit:

Tbh, including & in your text is not really a "normal" toot

If I want to write exactly “&” then yes, kinda ;-)

But conversion to ‘&’ is definitely wrong; if anything, it should have become ‘&’ though I would have expected ‘&’ but someone should probably check what CommonMark says.

We'll fix this after 0.14.0 is out probably.

OK.

I checked if I could trigger JavaScript from it, but I think the sanitiser runs afterwards and thus catches it, so it’s “only” an unterminated entity (I have seen browsers cut off at the ampersand until the next semicolon somewhere in the file or even until EOF if there was none, but not sure if there are any Fedi clients that would do that).

bye, //mirabilos

tsmethurst commented 5 months ago

think the sanitiser runs afterwards and thus catches it

It does yeah.

superseriousbusiness / gotosocial

[bug] `&` not HTML escaped to `&amp;` in Markdown-parsed HTML #2612

[bug] `&` not HTML escaped to `&` in Markdown-parsed HTML #2612