Closed gseastream closed 1 year ago
Hey, thanks for the report!
Using a build_email_re
with the longer TLD before its substring TLDs should work around the issue:
>>> from bleach.linkifier import Linker, build_email_re
>>> Linker(email_re=build_email_re(tlds=['party', 'pa']), parse_email=True).linkify('foo@bar.pa')
'<a href="mailto:foo@bar.pa">foo@bar.pa</a>'
>>> Linker(email_re=build_email_re(tlds=['party', 'pa']), parse_email=True).linkify('foo@bar.party')
'<a href="mailto:foo@bar.party">foo@bar.party</a>'
>>>
build_email_re
takes the TLDs in order, so longer TLDs need to come first e.g. build_email_re(tlds=['party', 'pa'])
will match domains roughly like (party|pa)
but the reverse would always match pa
before party
.
In bleach, we can:
build_email_re
to match longer domains first e.g. the builtin TLDs list has:# Make sure that .com doesn't get matched by .co first
TLDS.reverse()
mailto:
schemeDo you mean dropping Python 2 support?
My concern is more about handling TLDs that I decide not to match. Like whether I add party or rocks to my list is one thing, but even without those in the list I wouldn't want their cropped substring to be matching either (unless I really have an email like user@gmail.pa
). Especially because there are so many TLDs that I don't think I want to include all of them, and more will be created in the future. So there could very well be user-generated content with valid emails with new-fangled TLDs that I don't match, but then their emails would turn into broken links instead of the very benign alternative of staying as plain text.
Bringing this up also because I notice that linkify does not have this problem with urls, only emails. So google.party
will not match at all, but user@google.party
will match the pa
and cut off the rest. So it seems doable, I just don't understand enough of what's going on in that regex to get the email re to do whatever the url re is doing better.
Also while typing this I noticed that github markdown is 1) linkifying emails and urls no matter what their TLDs are, and 2) not linkifying urls unless there's actually a http(s) in front of it. Could bleach.linkify
mimic this behavior too? Then there's no more concern about keeping TLDs up to date, or matching partial TLDs. And with point (2) there's no more concern about matching models.py in the middle of sentence, as opposed to the actual link https://models.py. Of course then you don't match google.com without the https, but that seems normal (or at least worthy of a config option) to me.
GitHub linkify examples user@gmail.party, user@gmail.pa, user@gmail.whatever, models.whatever (not linkified), https://models.whatever
I may have found a solution. If you add a \b
(which I learned means word-boundary) to the end of this line:
So that it is
)@(?:[A-Z0-9](?:[A-Z0-9-]{{0,61}}[A-Z0-9])?\.)+(?:{0}))\b # domain
Then you get the following results, which I think match expectations (although there are some TLDs with hyphens in it so maybe that's a weird edge case):
>>> EMAIL_RE.match("user@gmail.rocks") == None
True
>>> EMAIL_RE.match("user@gmail.ro")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro.")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-woof")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
>>> EMAIL_RE.match("user@gmail.ro-woof.com")
<re.Match object; span=(0, 22), match='user@gmail.ro-woof.com'>
Although I would expect the following to not match at all, but this is a pretty weird example. If the regex accepted any TLD, then this wouldn't be a problem.
>>> EMAIL_RE.match("user@gmail.ro-woof.rocks")
<re.Match object; span=(0, 13), match='user@gmail.ro'>
Borrowing regex's from py-gfm's autolink and automail extensions, I've found much nicer and robust results that address everything above, basically mimicing GFM's own autolinking rules (currently with the addition of FTP links):
URL_RE = re.compile(
r"(?i)\b((?:(?:ftp|https?)://|www\d{0,3}[.])(?:[^\s()<>]+|"
r"\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()"
r"<>]+\)))*\)|[^\s`!()\[\]{};:" + r"'" + r'".,<>?«»“”‘’]))'
)
EMAIL_RE = re.compile(r"\b(?i)([a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]+)\b")
At the very least I am now using these expressions instead of the defaults with bleach.linkify
and I'm satisfied with that. I think it might also be worthwhile to either update the bleach.linkify
regex's to use the ones above (after whatever testing you need to do), or maybe add them to the docs as a good permissive alternative to bleach's default.
Bleach is deprecated, so we're not going to continue honing the linkifier. Sorry!
Describe the bug
bleach.linkify
withparse_email=True
will improperly and incompletely linkify emails with unrecognized TLDs, but whose TLD contains a recognized TLD.For example, by default, the
.rocks
TLD is not in the list of known TLDs. However,.ro
is in that list. Since.rocks
is not in the list, I would expect emails with that TLD to not be linkified (unless I decided to add.rocks
to the list). But instead, linkify is recognizing the beginning ofrocks
asro
and creating an email link ending withro
and cutting off thecks
. This can be pretty bad when run on user-generated content, since it'd be totally fine for the email to be left as plain text when given unrecognized TLDs, but instead a broken link is generated (and it might not be immediately clear that the link they're clicking actually didn't include the last few characters, depending on the styling of the link).The same is true for the non-default
.party
TLD and the default.pa
.I'm no regex expert, but is there a way to get the email_re to match only when the "whole" TLD matches? i.e. cut off by a space (end of word) or period+space (end of sentence) or end of string, or maybe some other edge cases (I know some of the TLDs on IANA list have numbers and hyphens in them which should not count as cutoffs).
To Reproduce
Expected behavior
Additional context
Another interesting way this problem occurs is even if
.rocks
is in the TLD list, if it occurs after.ro
in said list, it will match.ro
instead of.rocks
.