mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: `linkify` with `parse_email=True` doesn't handle "%" a "?" in `addr-specs` #658

Closed larseggert closed 2 years ago

larseggert commented 2 years ago

Describe the bug

bug: linkify with parse_email=True doesn't handle "%" and "?", which may occur in RFC822 addr-specs (see https://datatracker.ietf.org/doc/html/rfc2368#section-6)

To Reproduce

Steps to reproduce the behavior:

>>> bleach.linkify("gorby%kremvax@example.com", parse_email=True)
'<a href="mailto:gorby%kremvax@example.com">gorby%kremvax@example.com</a>'

Expected behavior

I expected RFC822 special characters to be percent-encoded according to RFC2368:

>>> bleach.linkify("gorby%kremvax@example.com", parse_email=True)
'<a href="mailto:gorby%25kremvax@example.com">gorby%kremvax@example.com</a>'

Additional context

Same issue exists with "?"; I didn't test other RFC822 special characters but suspect they are similarly left unquoted.

willkg commented 2 years ago

Thank you for the bug report! I'd appreciate a pull request from anyone who wants to tackle this. I don't think I'm going to get to it.

larseggert commented 2 years ago

I tried to wrap a urllib.parse.quote() around the the match.group(0) bit in https://github.com/mozilla/bleach/blob/481b146b074ed004eab39abf8f9b964fcd61c408/bleach/linkifier.py#L304 but that seems to have no effect.

jozo commented 2 years ago

I have noticed similar problem with clean() function. Maybe it has the same root cause.

Example:

In [1]: import bleach

In [2]: bleach.clean("<a href='https://example.org?a=1&b=2'>example</a>")
Out[2]: '<a href="https://example.org?a=1&amp;b=2">example</a>'

Notice that & is changed to &amp;.

willkg commented 2 years ago

@jozo that's not the same thing. The & should be escaped to &amp;.