mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: `\n` when stripping nested tags #663

Closed drjova closed 2 years ago

drjova commented 2 years ago

Describe the bug

A clear and concise description of what the bug is. [e.g. "bleach.clean does not escape script tag contents"]

python and bleach versions (please complete the following information):

To Reproduce

Steps to reproduce the behavior:

from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example
 example</div>
"""

Expected behavior

from bleach import clean
text = "<div>example<h1> example</h1></div>"
result = clean(text, attributes=[], tags=['div'], strip=True)
print(result)
"""
<div>example example</div>
"""

Thank you 🙏

willkg commented 2 years ago

h1 is a block level tag. Bleach 5.0.0 fixed sanitizing so that when it removes block-level tags, it adds a \n because that's what HTML parsers would do in those circumstances. The problem was covered in issue #369.

drjova commented 2 years ago

@willkg Thank you for the explanation. It would be nice to have an option to disable this since not all use-cases need to make the text more readable. Would it be considered if I made a PR?

willkg commented 2 years ago

What's your use case that this is problematic?

drjova commented 2 years ago

In our case we would like to clean specific tags, including block-level tags, without formatting the content.

willkg commented 2 years ago

That doesn't really answer my question--it mostly restates the bug. What's the use case here? Why is adding a \n problematic?