html.escape behaviour change in markupsafe 3.0

pallets / markupsafe

Safely add untrusted strings to HTML/XML markup.

https://markupsafe.palletsprojects.com

BSD 3-Clause "New" or "Revised" License

634 stars 156 forks source link

html.escape behaviour change in markupsafe 3.0 #481

Closed pelme closed 3 weeks ago

pelme commented 3 weeks ago

Consider this example:

import markupsafe
import html

markup = markupsafe.Markup("<div>hi</div>")

print('html.escape():', html.escape(markup))
print('markupsafe.escape():', markupsafe.escape(markup))

Output with MarkupSafe==2.1.5:

html.escape(): <div>hi</div>
markupsafe.escape(): <div>hi</div>

Output with MarkupSafe==3.0.0:

html.escape(): &amp;lt;div&amp;gt;hi&amp;lt;/div&amp;gt;
markupsafe.escape(): <div>hi</div>

I think this is a bug/regression and the 2.x behavior is what is expected. Also, the escaping does not make sense. In case it would actually be escaped, the output would should then be <div>hi</div>.

Environment:

Python version: 3.12.x.
MarkupSafe version: >=3.0.0

pelme commented 3 weeks ago

I guess this is based on html.escape just being a bunch of .replace() calls: https://github.com/python/cpython/blob/77d79989fd670633dce001877451f0dc120fbaf8/Lib/html/__init__.py#L10C1-L26C1

... and in combination with #401 the behavior changed. Not sure what is expected really.

davidism commented 3 weeks ago

It's hard to say what should be done here. Markup.replace escaping the new string makes sense to me, what if you for example did template.replace("USERNAME", username), you'd want the username value to get escaped. On the other hand, html.escape is specifically replacing with the escaped value, so MarkupSafe shouldn't double escape that value. But that's impossible to detect.

You can always convert a Markup string to a plain string, if you move to a context that doesn't understand Markup or the __html__ convention: html.escape(str(m)) will do the right thing.

davidism commented 3 weeks ago

You could also try to request that html.escape check for an __html__ method and call that if available, which is supported by Django and MarkupSafe (and probably others).

Or you could request that html.escape change its calls from s.replace("", "") to str.replace(s, "", ""). If it always calls the base string method rather than the object's type's method, then it will skip the escaping behavior of Markup.replace.

Not sure if either of these feature requests would be accpeted in python though.

davidism commented 3 weeks ago

Just stepped through html.escape(Markup("<test>")) with a debugger, turns out this wasn't doing the right thing in 2.1 either, for the exact reason in #401. Markup.replace used to escape every string argument, so Markup("<test>").replace("<", "<") was becoming "<test>".replace("<", "&lt;") and not replacing anything. That made it look like it was working, but it was actually just skipping everything unintentionally.

I'm going to say the new behavior makes more sense, it makes replace safe. There's no way to both make it work safely and not double escape when called by html.escape.

pelme commented 3 weeks ago

Thanks for the detailed response and looking into this. I agree that the new better and makes more sense.

(I got here because I missed to implement __html__() on a class that was passed to Django's conditional_escape (which respects html like markupsafe.escape as you pointed out above). Django then eventually ended up calling html.escape() on my object which worked with previous versions of markupsafe (by accident). Implementing __html__() on my class solved the problem: https://github.com/pelme/htpy/pull/65 🙂 ).

davidism commented 2 weeks ago

You could try opening a Django issue for it to do html.escape(str(v)) rather than html.escape(v), but given that there's also a supported way to fix it I'm not sure it's worth it.

pelme commented 2 weeks ago

Django sometimes does str(x).__html__() and sometimes just x.__html__() directly. I have been using this class together with various parts of Django that deals with html and never ran into this before. I guess both are "valid" uses of the "html protocol"? :/