rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

Fix relative paths #77

Closed lukavia closed 3 years ago

lukavia commented 3 years ago

A quick fix to not create wrong relative urls #73 It also optimizes the process to not try to parse images and videos.

nmullane commented 3 years ago

This pull request seems to generate errors with reading robot.txt file resulting in a very quick program termination. I've copied the error log below.

2021-01-05 11:20:22,468 - SPIDY - INFO - Successfully started crawler.
 2021-01-05 11:20:23,182 - SPIDY - ERROR -
 URL: https://mediawiki.org/
 ERROR: Unknown
 EXT: Invalid URL 'https:/robots.txt/robots.txtmediawiki.org/robots.txt': No host supplied

 2021-01-05 11:20:26,167 - SPIDY - ERROR -
 URL: https://en.wikivoyage.org/
 ERROR: Unknown
 EXT: Invalid URL 'https:/robots.txt/robots.txten.wikivoyage.org/robots.txt': No host supplied

 2021-01-05 11:20:27,400 - SPIDY - ERROR -
 URL: https://meta.wikimedia.org/
 ERROR: Unknown
 EXT: Invalid URL 'https:/robots.txt/robots.txtmeta.wikimedia.org/robots.txt': No host supplied

 2021-01-05 11:20:28,440 - SPIDY - ERROR -
 URL: https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Wikidata-logo.svg/47px-Wikidata-logo.svg.png
 ERROR: Unknown
 EXT: local variable 'word_list' referenced before assignment

 2021-01-05 11:20:28,780 - SPIDY - ERROR -
 URL: https://wikimediafoundation.org/
 ERROR: Unknown
 EXT: Invalid URL 'https:/robots.txt/robots.txtwikimediafoundation.org/robots.txt': No host supplied
lukavia commented 3 years ago

I've committed a fix. I would actually think that this problem would have existed before, but anyway. It would work now.

rivermont commented 3 years ago

Thanks for this change, @lukavia! From what I can tell it resolves #73 so I'll also close that.