Although In _genurls() a call to __get_tldpos() determines the correct position of the TLD using rfind(), this correction has no bearing on on _tldpos, leading to returned incorrect indices and an invalid offset on the next loop..
Should the same TLD appear multiple times within a hostname, it may match repeatedly.
For example
>>> txt = "String bbb.aaa.bbb.aaa.aaa test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
... print(out, txt[out[1][0] : out[1][1]])
...
('bbb.aaa.bbb.aaa.aaa', (-5, 14))
('bbb.aaa.bbb.aaa.aaa', (3, 22)) ing bbb.aaa.bbb.aaa
('bbb.aaa.bbb.aaa.aaa', (7, 26)) bbb.aaa.bbb.aaa.aaa
Should there be a query part in the string, further matches will possibly be skipped.
>>> txt = "String http://bbb.aaa.aaa/tests test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
... print(out, txt[out[1][0] : out[1][1]])
...
('http://bbb.aaa.aaa/tests', (3, 27)) ing http://bbb.aaa.aaa/t
Although In _genurls() a call to __get_tldpos() determines the correct position of the TLD using rfind(), this correction has no bearing on on _tldpos, leading to returned incorrect indices and an invalid offset on the next loop.. Should the same TLD appear multiple times within a hostname, it may match repeatedly. For example
Should there be a query part in the string, further matches will possibly be skipped.