lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Wrong indices and repeated matches when hostname contains the TLD #155

Open carton-of-mice opened 9 months ago

carton-of-mice commented 9 months ago

Although In _genurls() a call to __get_tldpos() determines the correct position of the TLD using rfind(), this correction has no bearing on on _tldpos, leading to returned incorrect indices and an invalid offset on the next loop.. Should the same TLD appear multiple times within a hostname, it may match repeatedly. For example

>>> txt = "String bbb.aaa.bbb.aaa.aaa test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
...     print(out, txt[out[1][0] : out[1][1]])
...
('bbb.aaa.bbb.aaa.aaa', (-5, 14))
('bbb.aaa.bbb.aaa.aaa', (3, 22)) ing bbb.aaa.bbb.aaa
('bbb.aaa.bbb.aaa.aaa', (7, 26)) bbb.aaa.bbb.aaa.aaa

Should there be a query part in the string, further matches will possibly be skipped.

>>> txt = "String http://bbb.aaa.aaa/tests test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
...     print(out, txt[out[1][0] : out[1][1]])
...
('http://bbb.aaa.aaa/tests', (3, 27)) ing http://bbb.aaa.aaa/t