Closed larseggert closed 2 years ago
You can supply your own list of tlds using:
LINKIFY_URL_REGEX = bleach.linkifier.build_url_re(tlds=top_level_domains, protocols=['http', 'https'])
And then using the url_re
kwarg when creating a Linker
or Cleaner
object.
So I tried something like this:
import bleach
import tlds
tlds = " ".join(tlds.tld_set)
bleach_linker = bleach.Linker(
url_re=bleach.linkifier.build_url_re(tlds=tlds),
email_re=bleach.linkifier.build_email_re(tlds=tlds),
parse_email=True
)
bleach_linker.linkify("Jon Peterson <jon.peterson@team.neustar>")
That results in 'Jon Peterson <<a href="mailto:jon.peterson@team.">jon.peterson@team.</a>neustar>
, which may be because both neustar
and team
are in tld_set
. In any event, there is still a bug here.
You should maybe look into identifying domain names similar to what https://github.com/lipoja/URLExtract does, i.e., work backwards from potential TLDs.
Hmm if I set tlds = ['neustar', 'team']
then I get the expected result with your code snippet:
Jon Peterson <<a href="mailto:jon.peterson@team.neustar">jon.peterson@team.neustar</a>>
Ah! I need to pass a list into the build_*_re
functions, not a string. That fixes it - thanks!
So tlds = tlds.tld_set
in the snippet above.
Describe the bug
The current TLD list at https://github.com/mozilla/bleach/blob/main/bleach/linkifier.py#L11-L23 does not contain most currently-existing TLDs (see, e.g., https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains).
This means that URLs and email addresses using such TLDs are not correctly handled.
To Reproduce
Expected behavior