mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: bleach.clean removing html and body tags even when explicitly allowed #650

Closed fayyazul-centaurlabs closed 2 years ago

fayyazul-centaurlabs commented 2 years ago

Describe the bug

bleach.clean removing html and body tags even when those tags are allowed.

python and bleach versions (please complete the following information):

To Reproduce

Steps to reproduce the behavior:

import bleach

if __name__ == '__main__':

    html = "<html><body><p>text</p></body></html>"
    parse = bleach.clean(html, tags=['html', 'body', 'p'])
    print(parse)

Expected output

<html><body><p>text</p></body></html>

Actual output

<p>text</p>
willkg commented 2 years ago

Bleach clean works on fragments--not entire HTML documents--so this is currently outside the scope of the project.

You could probably get by by removing the HTML and BODY tags, then running the result through Bleach clean.

Another option could be to write your own Cleaner class that overrides clean:

https://github.com/mozilla/bleach/blob/d30669b2571528430e90ed0d9b5cebba8e9f681e/bleach/sanitizer.py#L149-L188

You'd need to at least call parse instead of parseFragment. I'm not sure what else would need changing.

Hope that helps!

fayyazul-centaurlabs commented 2 years ago

That helps a lot. Thanks @willkg!