lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Wrong indices and incomplete extraction when string contains similar urls #142

Closed variablenerd closed 5 months ago

variablenerd commented 1 year ago

Hi, the find_urls() method returns incorrect url indices for the following input - (test_string OR url: https://www.russkiymir.ru/) OR (url: https://russkiymir.ru/en/ OR url: https://www.russkiymir.ru/cn/ OR url: https://www.russkiymir.ru/de/ OR url: 4pt.su) and it also fails to extract one of the urls (https://www.russkiymir.ru/de/)

To reproduce-

from urlextract import URLExtract

right_stop = url_extract_obj.get_stop_chars_right() | {')'}
left_stop = url_extract_obj.get_stop_chars_left() | {'('}
url_extract_obj.set_stop_chars_right(right_stop) 
url_extract_obj.set_stop_chars_left(left_stop) 

s = ''(test_string OR url: https://www.russkiymir.ru/) OR (url: https://russkiymir.ru/en/ OR url: https://www.russkiymir.ru/cn/ OR url: https://www.russkiymir.ru/de/ OR url: 4pt.su)''

urls = url_extract_obj.find_urls(s, get_indices=True)
print(urls)
indices = [url[1] for url in urls]
print(indices)
print("")
for index_tuple in indices:
    print(s[index_tuple[0]:index_tuple[1]])

Output:

[('https://www.russkiymir.ru/', (32, 58)), ('https://russkiymir.ru/en/', (58, 83)), ('https://www.russkiymir.ru/cn/', (103, 132)), ('4pt.su', (168, 174))]
[(32, 58), (58, 83), (103, 132), (168, 174)]

.russkiymir.ru/) OR (url: 
https://russkiymir.ru/en/
.russkiymir.ru/cn/ OR url: ht
4pt.su

I'm running this with Python 3.6 on Ubuntu 18.04.6.

lipoja commented 5 months ago

Tested this and it is working with latest version of urlextract.