Closed Larrax closed 5 years ago
Thank you for reporting it, I have to debug it. Right now I can not tell what is causing this issue.
@lipoja Hi, here is a simpler case of a missing URL:
>>> import urlextract
>>> urlextract.__version__
'0.11'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['https://medicalxpress.com/news/2017-09-margarine-butter.html']
From the last command above, two URLs were expected, but only one was returned. To get all the URLs, I am having to use a workaround such as the one below:
>>> words = [word for word in text.split() if not word.isalnum()]
>>> [url for s in words for url in url_extractor.find_urls(s)]
['https://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']
Please investigate. Thanks.
This issue should be fixed as part of 0.12.0 release
Thanks. I confirm that at least my reported example is fixed with 0.12.0:
>>> import urlextract
>>> urlextract.__version__
'0.12.0'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['http://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']
I also tested Larrax's example which too now works.
@lipoja There is just the issue of the print
line 624 in urlextract_core.py
.
@impredicative Thanks! Forgotten print is removed in 0.12.1.
I am satisfied. I will leave it to @Larrax to also test 0.12.1 and to maybe try to come up with any failing example if that is even possible.
OK, closing issue. @Larrax can reopen it if some related bug is found.
Thanks @impredicative for testing! I should not do late night releases ...
I have run into some text (email spam) that
find_urls
fails to extract all URLs from. Example input:There are 12 URLs, but urlextract finds only 7 of them. Found URLs:
The behavior is really strange. For example, if I remove the following URL from input:
https://content.idassociates.ca/images/shopico_new/spacer.png
, all the remaining 11 URLs are found. EDIT: Also sorry for not posting a smaller test input, but all bigger modifications led to the module working properly.Used version 0.10