bug: linkify emails matching only part of TLDs

gseastream commented 3 years ago

Describe the bug

bleach.linkify with parse_email=True will improperly and incompletely linkify emails with unrecognized TLDs, but whose TLD contains a recognized TLD.

For example, by default, the .rocks TLD is not in the list of known TLDs. However, .ro is in that list. Since .rocks is not in the list, I would expect emails with that TLD to not be linkified (unless I decided to add .rocks to the list). But instead, linkify is recognizing the beginning of rocks as ro and creating an email link ending with ro and cutting off the cks. This can be pretty bad when run on user-generated content, since it'd be totally fine for the email to be left as plain text when given unrecognized TLDs, but instead a broken link is generated (and it might not be immediately clear that the link they're clicking actually didn't include the last few characters, depending on the styling of the link).

The same is true for the non-default .party TLD and the default .pa.

I'm no regex expert, but is there a way to get the email_re to match only when the "whole" TLD matches? i.e. cut off by a space (end of word) or period+space (end of sentence) or end of string, or maybe some other edge cases (I know some of the TLDs on IANA list have numbers and hyphens in them which should not count as cutoffs).

Python Version: 3.8.5
Bleach Version: 3.2.1

To Reproduce

>>> import bleach
>>> bleach.linkify("user@gmail.rocks", parse_email=True)
'<a href="mailto:user@gmail.ro">user@gmail.ro</a>cks'

Expected behavior

>>> import bleach
>>> bleach.linkify("user@gmail.rocks", parse_email=True)
'user@gmail.rocks'

Additional context

Another interesting way this problem occurs is even if .rocks is in the TLD list, if it occurs after .ro in said list, it will match .ro instead of .rocks.

g-k commented 3 years ago

Hey, thanks for the report!

Using a build_email_re with the longer TLD before its substring TLDs should work around the issue:

>>> from bleach.linkifier import Linker, build_email_re
>>> Linker(email_re=build_email_re(tlds=['party', 'pa']), parse_email=True).linkify('foo@bar.pa')
'<a href="mailto:foo@bar.pa">foo@bar.pa</a>'
>>> Linker(email_re=build_email_re(tlds=['party', 'pa']), parse_email=True).linkify('foo@bar.party')
'<a href="mailto:foo@bar.party">foo@bar.party</a>'
>>>

build_email_re takes the TLDs in order, so longer TLDs need to come first e.g. build_email_re(tlds=['party', 'pa']) will match domains roughly like (party|pa) but the reverse would always match pa before party.

In bleach, we can:

[ ] update the docs to mention this weird behavior
[ ] change build_email_re to match longer domains first e.g. the builtin TLDs list has:

# Make sure that .com doesn't get matched by .co first                                                                                                                   
TLDS.reverse()

[ ] in 4.0.0 when we drop Python 2 support, use a non-regex parser e.g. https://docs.python.org/3/library/email.parser.html or https://docs.python.org/3/library/urllib.parse.html#url-parsing with just the mailto: scheme

gseastream commented 3 years ago

Do you mean dropping Python 2 support?

My concern is more about handling TLDs that I decide not to match. Like whether I add party or rocks to my list is one thing, but even without those in the list I wouldn't want their cropped substring to be matching either (unless I really have an email like user@gmail.pa). Especially because there are so many TLDs that I don't think I want to include all of them, and more will be created in the future. So there could very well be user-generated content with valid emails with new-fangled TLDs that I don't match, but then their emails would turn into broken links instead of the very benign alternative of staying as plain text.

Bringing this up also because I notice that linkify does not have this problem with urls, only emails. So google.party will not match at all, but user@google.party will match the pa and cut off the rest. So it seems doable, I just don't understand enough of what's going on in that regex to get the email re to do whatever the url re is doing better.

Also while typing this I noticed that github markdown is 1) linkifying emails and urls no matter what their TLDs are, and 2) not linkifying urls unless there's actually a http(s) in front of it. Could bleach.linkify mimic this behavior too? Then there's no more concern about keeping TLDs up to date, or matching partial TLDs. And with point (2) there's no more concern about matching models.py in the middle of sentence, as opposed to the actual link https://models.py. Of course then you don't match google.com without the https, but that seems normal (or at least worthy of a config option) to me.

GitHub linkify examples user@gmail.party, user@gmail.pa, user@gmail.whatever, models.whatever (not linkified), https://models.whatever

gseastream commented 3 years ago

I may have found a solution. If you add a \b (which I learned means word-boundary) to the end of this line:

https://github.com/mozilla/bleach/blob/f5971aa57461cce00d6dd474e901bcbd00f240f1/bleach/linkifier.py#L84

So that it is

        )@(?:[A-Z0-9](?:[A-Z0-9-]{{0,61}}[A-Z0-9])?\.)+(?:{0}))\b  # domain

Then you get the following results, which I think match expectations (although there are some TLDs with hyphens in it so maybe that's a weird edge case):

>>> EMAIL_RE.match("user@gmail.rocks") == None
True
>>> EMAIL_RE.match("user@gmail.ro")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro.")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-woof")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-woof.com")
<re.Match object; span=(0, 22), match='user@gmail.ro-woof.com'>

Although I would expect the following to not match at all, but this is a pretty weird example. If the regex accepted any TLD, then this wouldn't be a problem.

>>> EMAIL_RE.match("user@gmail.ro-woof.rocks")
<re.Match object; span=(0, 13), match='user@gmail.ro'>

gseastream commented 3 years ago

Borrowing regex's from py-gfm's autolink and automail extensions, I've found much nicer and robust results that address everything above, basically mimicing GFM's own autolinking rules (currently with the addition of FTP links):

URL_RE = re.compile(
    r"(?i)\b((?:(?:ftp|https?)://|www\d{0,3}[.])(?:[^\s()<>]+|"
    r"\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()"
    r"<>]+\)))*\)|[^\s`!()\[\]{};:" + r"'" + r'".,<>?«»“”‘’]))'
)

EMAIL_RE = re.compile(r"\b(?i)([a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]+)\b")

At the very least I am now using these expressions instead of the defaults with bleach.linkify and I'm satisfied with that. I think it might also be worthwhile to either update the bleach.linkify regex's to use the ones above (after whatever testing you need to do), or maybe add them to the docs as a good permissive alternative to bleach's default.

Click to see test examples

**Links and emails** https://forms.gle/xyz https://google.party/xyz https://google.com/xyz https://google.museum/xyz http://whatever.nothing www.google.com/foo www.models.py user.cool@gmail.com user.cool@gmail.party user.cool@gmail.rocks user.cool@gmail.museum doesntmatter@foo-bar.de ftp://this.also.works **Not links** google.com/xyz google.party/xyz google.com google.party models.py

willkg commented 1 year ago

Bleach is deprecated, so we're not going to continue honing the linkifier. Sorry!

mozilla / bleach

bug: linkify emails matching only part of TLDs #563