rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

String Index Error on perfectly normal URLs #54

Closed rivermont closed 6 years ago

rivermont commented 6 years ago

Checklist

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904 https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js https://github.com/rivermont/spidy/ https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

  1. Run the crawler.
  2. Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

Hrily commented 6 years ago

This happens because some of the links crawled are empty.

I'll send a PR with empty link checking.