lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

urlextract without authority causes AttributeError #137

Closed seanbreckenridge closed 1 year ago

seanbreckenridge commented 1 year ago

Hi, ever since 1.7.0 -- in particular it looks like #135, some URLs cause an error since the authority is None:

on 1.6.0:

In [1]: text = '[[ "$(giturl)" =~ ^https://gitlab.com ]] echo "found" || echo "didnt'

In [2]: import urlextract

In [3]: u = urlextract.URLExtract()

In [4]: list(u.gen_urls(text))
Out[4]: []

(I am not talking about this not finding the URL, just about this throwing an error)

on 1.7.0:

In [1]: text = '[[ "$(giturl)" =~ ^https://gitlab.com ]] echo "found" || echo "didnt'

In [2]: import urlextract
   ...: 

In [3]: u = urlextract.URLExtract()
   ...: 

In [4]: list(u.gen_urls(text))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [4], line 1
----> 1 list(u.gen_urls(text))

File ~/.local/lib/python3.10/site-packages/urlextract/urlextract_core.py:792, in URLExtract.gen_urls(self, text, check_dns, get_indices, with_schema_only)
    790 validated = self._validate_tld_match(text, tld, offset + tld_pos)
    791 if tld_pos != -1 and validated:
--> 792     tmp_url = self._complete_url(
    793         text,
    794         offset + tld_pos,
    795         tld,
    796         check_dns=check_dns,
    797         with_schema_only=with_schema_only,
    798     )
    800     if tmp_url:
    801         # do not search for TLD in already extracted URL
    802         tld_pos_url = self._get_tld_pos(tmp_url, tld)

File ~/.local/lib/python3.10/site-packages/urlextract/urlextract_core.py:494, in URLExtract._complete_url(self, text, tld_pos, tld, check_dns, with_schema_only)
    492 if complete_url.startswith(("-", ".", "~", "_")):
    493     complete_url = complete_url[1:]
--> 494 if not self._is_domain_valid(
    495     complete_url, tld, check_dns=check_dns, with_schema_only=with_schema_only
    496 ):
    497     return ""
    499 return complete_url

File ~/.local/lib/python3.10/site-packages/urlextract/urlextract_core.py:581, in URLExtract._is_domain_valid(self, url, tld, check_dns, with_schema_only)
    577 url_parts = uritools.urisplit(url)
    578 # <scheme>://<authority>/<path>?<query>#<fragment>
    579 
    580 # authority can't start with @
--> 581 if url_parts.authority.startswith('@'):
    582     return False
    584 # if URI contains user info and schema was automatically added
    585 # the url is probably an email

AttributeError: 'NoneType' object has no attribute 'startswith'

I believe would need to add a check for url_parts.authority to check if its None before checking for @?

lipoja commented 1 year ago

oh noes, thank you for reporting it @seanbreckenridge !

seanbreckenridge commented 1 year ago

Thanks for the quick fix! Can confirm issue is fixed on 1.7.1