lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Wrong indices with uppercase characters in domain name #117

Closed tkrissuu closed 1 year ago

tkrissuu commented 2 years ago

I am getting wrong indices when the domain name of a URL contains uppercase characters.

To reproduce:

from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("www.Google.com", get_indices=True)
print(urls[0])
urls = extractor.find_urls("www.google.com", get_indices=True)
print(urls[0])

Output is:

('www.Google.com', (1, 15))
('www.google.com', (0, 14))
elliotwutingfeng commented 2 years ago

While URI paths can have uppercase characters, the scheme and authority of the URI is always lowercase, though web browsers would normally auto-correct the scheme and authority to lowercase. I think the problem has to do with the regex expression found in gen_urls() failing to account for uppercase characters in the URI scheme and authority.