Closed shacker closed 7 years ago
I don't understand why this is an issue. Why is &
not desirable?
Because now I have to mark it as an "html-safe" field on output so that the entity is handled properly by the browser. Yes it's true that I can logically consider it safe since I've already bleached the field, but it feels odd to not have control over whether non-dangerous characters like this are modified by bleach.
The front-end folks are asking me to fix it on the back-end, but maybe that's not the right answer.
That doesn't really answer my question. What's wrong with &
? How is the resulting text being used such that &
is not ok?
The use case is an editable-text
field, which is editable in place. On page load, if I mark the field as safe, it's fine. But then the user clicks on the words to go into edit mode and suddenly it's transformed into an input field. Since html isn't rendered inside an input field, the user is now looking at this & that
I've thought about this a bit and while I see the problem that this creates, I think it's the right thing for Bleach to do. Similarly, the user will see >
for >
and <
for <
.
It's important for Bleach to convert these characters. I'm loathe to add a flag to not escape things--at that point I'm not sure what Bleach is really doing.
I think the best way to deal with this is for you to either use something other than Bleach or to somehow safely unescape the characters when the user switches to edit mode.
Given that, I'm going to close this out.
I'm in the exact same boat.
I'm using bleach to clean the data, remove tags, etc. and not for safety, as modern front-end frameworks like React sanitise the displayed data automatically.
The fact that a user cannot enter A & B
without turning it into A & B
is quite a big limitation of this library, when using it in modern front-end stacks. (//modern: React is actually 5 years old).
@hyperknot That's not Bleach's intended purpose. You should find a different library to use or roll your own.
I don't get it, bleach is the standard Python library for sanitising and cleaning html.
Do you recommend going against every template language's best practice on earth, and ask the user to switch off auto-escaping by default, just so they are allowed to have & characters? (in <title>
for example, and then it's unrelated to front-end libs)
@hyperknot The output of bleach.clean()
is safe for HTML. When using the output with templating languages like Jinja2 and Django, it should be marked as safe. If you don't do that, then you've got multiple things doing sanitization passes on the text. It's much harder to reason about the safety properties and shape of the output in that scenario. It's better to have one thing sanitize the text.
Given that, I don't agree that every template language's best practices are as you state. Most/many/all of them include a |safe
filter or annotation for text that is safe for HTML context for this purpose.
Regarding use with JS libs and frameworks, it should work. Projects that I work on have used Bleach this way. The use case in this issue is that the output of bleach.clean()
is being used both for output in an HTML context as well as being editable. That's a complicated scenario, but I think it's more correctly solved as having the user edit the original text and then round-tripping that with the server to get the sanitized text for HTML context. That's more correct, but it requires a round-trip to the server and that kind of sucks.
I'd love to talk about your specific scenario and help you through it. If it's exactly the same as this issue (need to use bleach.clean
output for a field that a user edits), we can keep the conversation here. Otherwise can you create a new issue for it?
In either case, more details about your specific problem would be helpful. Can you point me to the source code? If not, can you prepare an example that shows the issue?
Yes, it is user editable fields. I'm running a website (https://maphub.net/) which allows users to create interactive maps, as well as to upload all kind of user supplied data in various format. I'm converting HTML to Markdown on upload, and running bleach.clean() on the data afterwards.
The concept here is very simple. I'd like to store clear data in the database. Clear which means no leftover tags and no HTML entities. Just pure text data, as it should be in a database. Whichever representation of that data happens (either server side Jinja2 or client side React), should work on clean data.
Think of it like this:
bleach.clean()
on data writes and// Actually, my workaround is the following for now, and is working perfectly. I just wanted to mention it'd be nice to have it built in.
cleaned = bleach.clean(text, tags=allowed_tags, strip=True, attributes=attrs).strip()
cleaned = cleaned.replace('&', '&')
I'd agree with @hyperknot, storing HTML in the DB where we expect it not to be encoded (because we sanitized ALL html) is prone to a double encoded problem. E.g., &
in a string is stored as &
then encoded for HTML and the result would be &amp
on the browser. Recommend as an option (off by default) for this particular use-case.
just ran into this and agree with @hyperknot it'd be much more convenient to support leaving certain characters alone e.g. "&"
We ran into the use case where user input is like 'Jeff&John' and show it on the display screen. bleach converts & to & the display screen is showing as 'Jeff&John' and treating it as a bug for an end-user perspective.
what will be the security implication if we go with the workaround solution by @hyperknot
cleaned = cleaned.replace('&', '&')
I am also affected of this issue.
Background:
I develop a django app which takes user generated conent and renders it to markdown (edit: html). I also want to support mathematical formulas (from LaTeX source) with https://facelessuser.github.io/pymdown-extensions/extensions/arithmatex/. However such formulas might perfectly contain &
, <
, and >
symbols.
The brutal .replace('&', '&')
-approach seems error-prone since the rest of markdown should be sanitized as bleach does by default, i.e. &
→ &
. I currently only see the option to use some regex magic (BeautifulSoup) to handle everything inside tags like <span class="MathJax_Preview">
separately.
I would be glad if there were a more elegant solution.
@cknoll Where are you using Bleach in "takes user generated content and renders it to Markdown"?
Sorry, I explained it wrong: I take user generated content which is mostly markdown (+ LaTeX code) pass it to markdown
for rendering HTML and then passing it through bleach to avoid harmful things showing up in the result. By the way: here is the above mentioned workaround (which might be helpful for others): markpad/mainapp/util.py#L124.
@cknoll I think you should write up a new issue. Your issue is more about using Bleach to correctly clean the output of Markdown rendering to HTML that has mathjax bits <script>
tags. This issue is more about saving Bleach output in a database and then letting users edit the cleaned output rather than their original input.
My workaround, admittedly not ideal.
# handle ampersand.
text = text.replace(r'&', '<amp></amp>')
# bleach the text.
text = (
bleach
.clean(text, tags=['amp',])
.replace(r'<amp></amp>', '&')
)
I would like to add my use case that also ran into this issue. I have a place where people can enter URLs and I want to make sure they don't enter HTML tags. This library does the following to URLs
https://test.com?helllo=1&world=2
to https://test.com?helllo=1&world=2
This isn't ideal and it would be great to be able to turn it off. Tag escaping can be turned off, not sure why this can't.
Why is this issue closed while this is still a problem?
This isn't ideal and it would be great to be able to turn it off. Tag escaping can be turned off, not sure why this can't.
I absolutly agree with this. There are enough usecases to reopen this issue and rethink the 8+ year old decissions
I looked back through the comments and I still think I made the right design choice. I'm not obligated to continue revisiting this even if all the people in the world think I'm wrong. I understand if that's disappointing. Good library design is rarely disappointment-free.
Regardless, this library is deprecated and it's in a minimal-maintenance mode so I wouldn't implement this even if I changed my mind. I'm leaving this comment as a courtesy because for some reason I woke up this morning and felt like reading Bleach-related email.
I encourage you all to find or create another library to do the thing you need.
A user types
this & that
into a text field, which I then bleach :title = bleach.clean(title, strip=True, tags=[])
bleach converts the ampersand into an html entity (
&
)bleach provides means to whitelist tags, attributes and styles, but no way to whitelist special characters. Now my only option is to mark the field "safe" on output.... but I don't really want to do that.
Is there no way to prevent bleach from converting ampersands into entities?