mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

feature: strip all URLs #664

Closed jvanasco closed 1 year ago

jvanasco commented 2 years ago

It does not seem possible to strip all URLs with Bleach.

For example, the closest we can get to from the docs is...

import bleach
def remove_it(attrs, new=False):
    return None
payloads = (
    'a <a href="http://example.com/outer">https://example.com/inner</a> b',
    "a https://example.com/bare b",
)
for payload in payloads:
    print("=====")
    result = bleach.linkify(payload, callbacks=[remove_it])
    print(result)
    result = bleach.clean(payload, protocols=[])
    print(result)

However, The result is:

=====
a https://example.com/inner b
a <a>https://example.com/inner</a> b
=====
a https://example.com/bare b
a https://example.com/bare b

While the desired result is simply:

=====
a  b
a  b
=====
a  b
a  b

In many situations dealing with User Generated Content, preventing any URLs whatsoever is desirable - even rendered as plaintext. Currently, this must be handled outside of bleach in a separate processing step. Being able to filter this out within bleach is desirable, as the URLs have already been parsed.

willkg commented 2 years ago

I think you need to write a new filter. I bet you could base it on the current LinkifyFilter but change this part here:

https://github.com/mozilla/bleach/blob/4f951d3299ca23545231a5b0a64223ece384bb8b/bleach/linkifier.py#L316-L332

Does that help?

willkg commented 1 year ago

I'm assuming that helped. Closing this out.