mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.66k stars 250 forks source link

bug: bleach truncates Katex style attributes #676

Closed nguiard closed 1 year ago

nguiard commented 2 years ago

Bleach truncates a lot of Katex style attributes

Basic example: a markdown_katex output may contain a span like so : <span class="vlist" style="height:1.0697em;">. It contains a style attribute, and when passed through bleach (allowing the style attribute), I get this:

<span class="vlist" style=""></span>

While the desired output would be:

<span class="vlist" style="height:1.0697em;"></span>

As a result, actual Katex math doesn't render properly.

python and bleach versions:

To Reproduce

Steps to reproduce the behavior:

from bleach import Cleaner

cleaner = Cleaner(tags = ['span'],
                  attributes = {'span': ['class', 'style']})

minimal_katex_span = '<span class="vlist" style="height:1.0697em;">'
res = cleaner.clean(minimal_katex_span)
print(res)

Additional context

I am unsure if this is actually a bug or intended behavior in some way. The more general problem I face is: how to correctly use bleach after user input is transformed through markdown with the markdown_katex extension?

willkg commented 2 years ago

Did you install the css extras?

https://bleach.readthedocs.io/en/latest/clean.html#sanitizing-css

nguiard commented 2 years ago

Oh sorry I didn't. It's probably just that. I'll do that and reopen if needed. Thanks!

willkg commented 2 years ago

Can I get some help with this? The thing you're hitting is this:

https://github.com/mozilla/bleach/blob/6cd4d527a6b43569c1e1490e632500199b1efb6c/bleach/sanitizer.py#L555-L561

Would it have helped if Bleach had emitted a Python warning because you've got "style" as an allowed attribute, but hadn't specified a css_sanitizer? If not that, should it throw an exception? I'm pretty sure the situation is an indication of a mistake and a developer would want to know and not have the problem you just had. I can't think of a case where you'd want to be in that situation (specifying style as allowed, but don't want to have the css sanitized), but I didn't know if I was lacking imagination or not. What do you think?

nguiard commented 2 years ago

Sure! So, first of all, installing and using the css extras fixed my issue.

But as you suggested, effectively I think it would have been very nice to have a Python warning or error about that. Being a bit new to bleach and just wanting to adjust my previous basic bleaching to now allow for katex markup, I looked at the docs and the issues here, but did not get at first that the css extras would be relevant. I saw the css_sanitizer option in Cleaner, but I thought that a value of None would not parse/sanitize the css.

I think it's not crazy to think that at first (after all, it feels natural that "None" sanitizer would sanitize nothing), even though I understand that not sanitizing the css would rarely be the correct call.

willkg commented 2 years ago

I'm going to re-open this to cover two changes:

  1. Add a note to the clean docs about how if you're allowing the style attribute, you should also set a css_sanitizer otherwise the style value will be truncated.
  2. Change the code to emit a Python warning when style is allowed, but the css_sanitizer is not set.
nguiard commented 2 years ago

Related to this is the question of what tags and styles we should allow for Katex, as it is not necessarily trivial to get the complete list.

And more generally, say in theory you trust a plugin's output (not saying I trust Katex output specifically), but if that plugin uses a lot of tags, then you end up allowing a lot of tags you wouldn't have allowed normally. The allowed tags approach seems kind of flawed in that case. I don't know if there is a better way in these kinds of cases, like maybe treating parts separately...

willkg commented 2 years ago

Having a context aware allow list could help here. Bleach definitely doesn't support that currently. It feels like it'd be hard to implement because the stripping/escaping for tags is spread across a few classes, but maybe that's not true. You could try looking into that.