Closed lukavia closed 3 years ago
This pull request seems to generate errors with reading robot.txt file resulting in a very quick program termination. I've copied the error log below.
2021-01-05 11:20:22,468 - SPIDY - INFO - Successfully started crawler.
2021-01-05 11:20:23,182 - SPIDY - ERROR -
URL: https://mediawiki.org/
ERROR: Unknown
EXT: Invalid URL 'https:/robots.txt/robots.txtmediawiki.org/robots.txt': No host supplied
2021-01-05 11:20:26,167 - SPIDY - ERROR -
URL: https://en.wikivoyage.org/
ERROR: Unknown
EXT: Invalid URL 'https:/robots.txt/robots.txten.wikivoyage.org/robots.txt': No host supplied
2021-01-05 11:20:27,400 - SPIDY - ERROR -
URL: https://meta.wikimedia.org/
ERROR: Unknown
EXT: Invalid URL 'https:/robots.txt/robots.txtmeta.wikimedia.org/robots.txt': No host supplied
2021-01-05 11:20:28,440 - SPIDY - ERROR -
URL: https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Wikidata-logo.svg/47px-Wikidata-logo.svg.png
ERROR: Unknown
EXT: local variable 'word_list' referenced before assignment
2021-01-05 11:20:28,780 - SPIDY - ERROR -
URL: https://wikimediafoundation.org/
ERROR: Unknown
EXT: Invalid URL 'https:/robots.txt/robots.txtwikimediafoundation.org/robots.txt': No host supplied
I've committed a fix. I would actually think that this problem would have existed before, but anyway. It would work now.
Thanks for this change, @lukavia! From what I can tell it resolves #73 so I'll also close that.
A quick fix to not create wrong relative urls #73 It also optimizes the process to not try to parse images and videos.