TLD list is outdated - Githubissues

mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

https://bleach.readthedocs.io/en/latest/

Other

2.65k stars 253 forks source link

TLD list is outdated #656

Closed larseggert closed 2 years ago

larseggert commented 2 years ago

Describe the bug

The current TLD list at https://github.com/mozilla/bleach/blob/main/bleach/linkifier.py#L11-L23 does not contain most currently-existing TLDs (see, e.g., https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains).

This means that URLs and email addresses using such TLDs are not correctly handled.

Python Version: 3.9.12
Bleach Version: 5.0.0

To Reproduce

>>> bleach.linkify("Jon Peterson <jon.peterson@team.neustar>", parse_email=True)
'Jon Peterson &lt;<a href="mailto:jon.peterson@team.ne">jon.peterson@team.ne</a>ustar&gt;'

Expected behavior

>>> bleach.linkify("Jon Peterson <jon.peterson@team.neustar>", parse_email=True)
'Jon Peterson &lt;<a href="mailto:jon.peterson@team.neustar">jon.peterson@team.neustar</a>&gt;'

Alex3917 commented 2 years ago

You can supply your own list of tlds using:

LINKIFY_URL_REGEX = bleach.linkifier.build_url_re(tlds=top_level_domains, protocols=['http', 'https'])

And then using the url_re kwarg when creating a Linker or Cleaner object.

larseggert commented 2 years ago

So I tried something like this:

import bleach
import tlds

tlds = " ".join(tlds.tld_set)
bleach_linker = bleach.Linker(
    url_re=bleach.linkifier.build_url_re(tlds=tlds),
    email_re=bleach.linkifier.build_email_re(tlds=tlds),
    parse_email=True
)
bleach_linker.linkify("Jon Peterson <jon.peterson@team.neustar>")

That results in 'Jon Peterson <<a href="mailto:jon.peterson@team.">jon.peterson@team.</a>neustar>, which may be because both neustar and team are in tld_set. In any event, there is still a bug here.

You should maybe look into identifying domain names similar to what https://github.com/lipoja/URLExtract does, i.e., work backwards from potential TLDs.

Alex3917 commented 2 years ago

Hmm if I set tlds = ['neustar', 'team'] then I get the expected result with your code snippet:

Jon Peterson <<a href="mailto:jon.peterson@team.neustar">jon.peterson@team.neustar</a>>

larseggert commented 2 years ago

Ah! I need to pass a list into the build_*_re functions, not a string. That fixes it - thanks!

So tlds = tlds.tld_set in the snippet above.