lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Missing URLs using find_urls #42

Closed Larrax closed 5 years ago

Larrax commented 5 years ago

I have run into some text (email spam) that find_urls fails to extract all URLs from. Example input:

One night's accommodation, double occupancy http://example.com/gxhcht-5kdpwgk3/ 

$8000 http://example.com/gxhchu-5kdpwgk4/   
Value $166.00 http://example.com/gxhchv-5kdpwgk5/   
 https://content.idassociates.ca/images/shopico_new/spacer.png   http://example.com/gxhchw-5kdpwgk6/    

Like in the South - Deluxe Room http://example.com/gxhchy-5kdpwgk8/ 

$20308 http://example.com/gxhchz-5kdpwgk9/  
Value $406.19 http://example.com/gxhci0-5kdpwgk6/   
 https://content.idassociates.ca/images/shopico_new/spacer.png  
 http://example.com/gxhci1-5kdpwgk7/    
Camping de la rivire Nicolet http://example.com/gxhci2-5kdpwgk8/ 

Accommodations / Cottage http://example.com

There are 12 URLs, but urlextract finds only 7 of them. Found URLs:

http://example.com/gxhcht-5kdpwgk3/
http://example.com/gxhchu-5kdpwgk4/
http://example.com/gxhchv-5kdpwgk5/
https://content.idassociates.ca/images/shopico_new/spacer.png
http://example.com/gxhci1-5kdpwgk7/
http://example.com/gxhci2-5kdpwgk8/
http://example.com

The behavior is really strange. For example, if I remove the following URL from input: https://content.idassociates.ca/images/shopico_new/spacer.png, all the remaining 11 URLs are found. EDIT: Also sorry for not posting a smaller test input, but all bigger modifications led to the module working properly.

Used version 0.10

lipoja commented 5 years ago

Thank you for reporting it, I have to debug it. Right now I can not tell what is causing this issue.

impredicative commented 5 years ago

@lipoja Hi, here is a simpler case of a missing URL:

>>> import urlextract
>>> urlextract.__version__
'0.11'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['https://medicalxpress.com/news/2017-09-margarine-butter.html']

From the last command above, two URLs were expected, but only one was returned. To get all the URLs, I am having to use a workaround such as the one below:

>>> words = [word for word in text.split() if not word.isalnum()]
>>> [url for s in words for url in url_extractor.find_urls(s)]
['https://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']

Please investigate. Thanks.

lipoja commented 5 years ago

This issue should be fixed as part of 0.12.0 release

impredicative commented 5 years ago

Thanks. I confirm that at least my reported example is fixed with 0.12.0:

>>> import urlextract
>>> urlextract.__version__
'0.12.0'
>>> url_extractor = urlextract.URLExtract()
>>> url_extractor.find_urls('https://google.com https://bing.com')
['https://google.com', 'https://bing.com']
>>> text = 'http://medicalxpress.com/news/2017-09-margarine-butter.html https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html'
>>> url_extractor.find_urls(text)
['http://medicalxpress.com/news/2017-09-margarine-butter.html', 'https://medicalxpress.com/news/2013-12-healthier-butter-ormargarine.html']

I also tested Larrax's example which too now works.

impredicative commented 5 years ago

@lipoja There is just the issue of the print line 624 in urlextract_core.py.

lipoja commented 5 years ago

@impredicative Thanks! Forgotten print is removed in 0.12.1.

impredicative commented 5 years ago

I am satisfied. I will leave it to @Larrax to also test 0.12.1 and to maybe try to come up with any failing example if that is even possible.

lipoja commented 5 years ago

OK, closing issue. @Larrax can reopen it if some related bug is found.

Thanks @impredicative for testing! I should not do late night releases ...