lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

Case sensitivity in detecting URLs #76

Closed philshem closed 3 years ago

philshem commented 4 years ago

I know it's a hot issue, but practically the domain part of web URLs are case insensitive.

See this answer to the same question, which points out that the domain at least is case insensitive, and the rest of what is sent to the server is not. https://tools.ietf.org/html/rfc4343

For that reason, my expected behavior for this library would be that URLs are detected based on the .lower()

Here is an example of code that should (imho) detect both URLs, but only the second, with the lower case .tld, is extracted.

from urlextract import URLExtract
extractor = URLExtract()
print(extractor.find_urls('https://www.noworks.COM'))
print(extractor.find_urls('https://www.works.com'))

gives output

[]
['https://www.works.com']

my versions

Python 3.8.5 (default, Jul 21 2020, 10:48:26) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin

and

>>> import urlextract
>>> print(urlextract.__version__)
1.0.0
lipoja commented 4 years ago

Thank you @philshem, this is good suggestion. I agree that it should be case insensitive for TLD.

lipoja commented 3 years ago

Searching for TLDs will be case insensitive in next release.