mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 252 forks source link

Handling raw ampersands #192

Closed shacker closed 7 years ago

shacker commented 8 years ago

A user types this & that into a text field, which I then bleach :

title = bleach.clean(title, strip=True, tags=[])

bleach converts the ampersand into an html entity (&)

bleach provides means to whitelist tags, attributes and styles, but no way to whitelist special characters. Now my only option is to mark the field "safe" on output.... but I don't really want to do that.

Is there no way to prevent bleach from converting ampersands into entities?

willkg commented 8 years ago

I don't understand why this is an issue. Why is & not desirable?

shacker commented 8 years ago

Because now I have to mark it as an "html-safe" field on output so that the entity is handled properly by the browser. Yes it's true that I can logically consider it safe since I've already bleached the field, but it feels odd to not have control over whether non-dangerous characters like this are modified by bleach.

The front-end folks are asking me to fix it on the back-end, but maybe that's not the right answer.

willkg commented 8 years ago

That doesn't really answer my question. What's wrong with &? How is the resulting text being used such that & is not ok?

shacker commented 8 years ago

The use case is an editable-text field, which is editable in place. On page load, if I mark the field as safe, it's fine. But then the user clicks on the words to go into edit mode and suddenly it's transformed into an input field. Since html isn't rendered inside an input field, the user is now looking at this & that

willkg commented 7 years ago

I've thought about this a bit and while I see the problem that this creates, I think it's the right thing for Bleach to do. Similarly, the user will see &gt; for > and &lt; for <.

It's important for Bleach to convert these characters. I'm loathe to add a flag to not escape things--at that point I'm not sure what Bleach is really doing.

I think the best way to deal with this is for you to either use something other than Bleach or to somehow safely unescape the characters when the user switches to edit mode.

Given that, I'm going to close this out.

hyperknot commented 6 years ago

I'm in the exact same boat.

I'm using bleach to clean the data, remove tags, etc. and not for safety, as modern front-end frameworks like React sanitise the displayed data automatically.

The fact that a user cannot enter A & B without turning it into A &amp; B is quite a big limitation of this library, when using it in modern front-end stacks. (//modern: React is actually 5 years old).

willkg commented 6 years ago

@hyperknot That's not Bleach's intended purpose. You should find a different library to use or roll your own.

hyperknot commented 6 years ago

I don't get it, bleach is the standard Python library for sanitising and cleaning html.

Do you recommend going against every template language's best practice on earth, and ask the user to switch off auto-escaping by default, just so they are allowed to have & characters? (in <title> for example, and then it's unrelated to front-end libs)

willkg commented 6 years ago

@hyperknot The output of bleach.clean() is safe for HTML. When using the output with templating languages like Jinja2 and Django, it should be marked as safe. If you don't do that, then you've got multiple things doing sanitization passes on the text. It's much harder to reason about the safety properties and shape of the output in that scenario. It's better to have one thing sanitize the text.

Given that, I don't agree that every template language's best practices are as you state. Most/many/all of them include a |safe filter or annotation for text that is safe for HTML context for this purpose.

Regarding use with JS libs and frameworks, it should work. Projects that I work on have used Bleach this way. The use case in this issue is that the output of bleach.clean() is being used both for output in an HTML context as well as being editable. That's a complicated scenario, but I think it's more correctly solved as having the user edit the original text and then round-tripping that with the server to get the sanitized text for HTML context. That's more correct, but it requires a round-trip to the server and that kind of sucks.

I'd love to talk about your specific scenario and help you through it. If it's exactly the same as this issue (need to use bleach.clean output for a field that a user edits), we can keep the conversation here. Otherwise can you create a new issue for it?

In either case, more details about your specific problem would be helpful. Can you point me to the source code? If not, can you prepare an example that shows the issue?

hyperknot commented 6 years ago

Yes, it is user editable fields. I'm running a website (https://maphub.net/) which allows users to create interactive maps, as well as to upload all kind of user supplied data in various format. I'm converting HTML to Markdown on upload, and running bleach.clean() on the data afterwards.

The concept here is very simple. I'd like to store clear data in the database. Clear which means no leftover tags and no HTML entities. Just pure text data, as it should be in a database. Whichever representation of that data happens (either server side Jinja2 or client side React), should work on clean data.

Think of it like this:

  1. Using the slow but thorough bleach.clean() on data writes and
  2. Using the quick, built in filters of the template libraries on data reads.

// Actually, my workaround is the following for now, and is working perfectly. I just wanted to mention it'd be nice to have it built in.

cleaned = bleach.clean(text, tags=allowed_tags, strip=True, attributes=attrs).strip()
cleaned = cleaned.replace('&amp;', '&')
GregTurner commented 6 years ago

I'd agree with @hyperknot, storing HTML in the DB where we expect it not to be encoded (because we sanitized ALL html) is prone to a double encoded problem. E.g., & in a string is stored as &amp; then encoded for HTML and the result would be &amp;amp on the browser. Recommend as an option (off by default) for this particular use-case.

hoIIer commented 4 years ago

just ran into this and agree with @hyperknot it'd be much more convenient to support leaving certain characters alone e.g. "&"

soujanyat commented 2 years ago

We ran into the use case where user input is like 'Jeff&John' and show it on the display screen. bleach converts & to & the display screen is showing as 'Jeff&John' and treating it as a bug for an end-user perspective.

what will be the security implication if we go with the workaround solution by @hyperknot cleaned = cleaned.replace('&amp;', '&')

cknoll commented 2 years ago

I am also affected of this issue.

Background: I develop a django app which takes user generated conent and renders it to markdown (edit: html). I also want to support mathematical formulas (from LaTeX source) with https://facelessuser.github.io/pymdown-extensions/extensions/arithmatex/. However such formulas might perfectly contain &, <, and > symbols.

The brutal .replace('&amp;', '&')-approach seems error-prone since the rest of markdown should be sanitized as bleach does by default, i.e. &&amp;. I currently only see the option to use some regex magic (BeautifulSoup) to handle everything inside tags like <span class="MathJax_Preview"> separately.

I would be glad if there were a more elegant solution.

willkg commented 2 years ago

@cknoll Where are you using Bleach in "takes user generated content and renders it to Markdown"?

cknoll commented 2 years ago

Sorry, I explained it wrong: I take user generated content which is mostly markdown (+ LaTeX code) pass it to markdown for rendering HTML and then passing it through bleach to avoid harmful things showing up in the result. By the way: here is the above mentioned workaround (which might be helpful for others): markpad/mainapp/util.py#L124.

willkg commented 2 years ago

@cknoll I think you should write up a new issue. Your issue is more about using Bleach to correctly clean the output of Markdown rendering to HTML that has mathjax bits <script> tags. This issue is more about saving Bleach output in a database and then letting users edit the cleaned output rather than their original input.

hoIIer commented 2 years ago

My workaround, admittedly not ideal.

    # handle ampersand.
    text = text.replace(r'&', '<amp></amp>')

    # bleach the text.
    text = (
        bleach
        .clean(text, tags=['amp',])
        .replace(r'<amp></amp>', '&')
    )
taylor-cedar commented 1 year ago

I would like to add my use case that also ran into this issue. I have a place where people can enter URLs and I want to make sure they don't enter HTML tags. This library does the following to URLs

https://test.com?helllo=1&world=2 to https://test.com?helllo=1&amp;world=2

This isn't ideal and it would be great to be able to turn it off. Tag escaping can be turned off, not sure why this can't.

Speedy1991 commented 1 month ago

Why is this issue closed while this is still a problem?

This isn't ideal and it would be great to be able to turn it off. Tag escaping can be turned off, not sure why this can't.

I absolutly agree with this. There are enough usecases to reopen this issue and rethink the 8+ year old decissions

willkg commented 1 month ago

I looked back through the comments and I still think I made the right design choice. I'm not obligated to continue revisiting this even if all the people in the world think I'm wrong. I understand if that's disappointing. Good library design is rarely disappointment-free.

Regardless, this library is deprecated and it's in a minimal-maintenance mode so I wouldn't implement this even if I changed my mind. I'm leaving this comment as a courtesy because for some reason I woke up this morning and felt like reading Bleach-related email.

I encourage you all to find or create another library to do the thing you need.