lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

Wrong indices when the domain name contains the same TLD twice #109

Closed tkrissuu closed 2 years ago

tkrissuu commented 2 years ago

I am seeing a strange behaviour when finding URLs with indices. It is triggered when the TLD is also present earlier in the domain name. In the example below ".com" appears twice. But it will also fail with e.g. "www.dk-hostmaster.dk".

To reproduce:

from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("www.company.com", get_indices=True)
print(urls[0])

Output is: ('www.company.com', (8, 23))

I would have expected: ('www.company.com', (0, 15))

lipoja commented 2 years ago

@tkrissuu Thank you for reporting this issue. Nice catch! I know what is the issue here and I will send fix soon.

lipoja commented 2 years ago

@tkrissuu It should be fixed in v1.5.0. Thanks!

tkrissuu commented 2 years ago

@lipoja Awesome. Thanks! That was quick.