lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Invalid URLs accepted with subdomains #156

Open carton-of-mice opened 8 months ago

carton-of-mice commented 8 months ago

If provided with a potential hostname with multiple dots, only the most top-level domain below the TLD is validated.

>>> import urlextract
>>> print(urlextract.URLExtract().find_urls('sample :--.-.:3.2.com sample'))
[':--.-.:3.2.com']

This report is related to #121 - after invalid characters are consumed, __is_domainvalid() only applies validation regex against host.split(".")[-2], ignoring invalid DNS labels in earlier parts.