Enhance exception description when CRAWLERA_URL is missing the scheme

raphapassini commented 5 years ago

If you inadvertently set you CRAWLERA_URL setting without the URL scheme like:

CRAWLERA_URL = "proxy.crawlera.com:8010"

You'll receive a non-descriptive twisted exception when trying to crawl http://

Traceback (most recent call last):
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/bin/scrapy", line 10, in <module>
    sys.exit(execute())
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/commands/shell.py", line 74, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/shell.py", line 47, in start
    self.fetch(url, spider, redirect=redirect)
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/scrapy/shell.py", line 120, in fetch
    reactor, self._schedule, request, spider
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/home/raphael/.virtualenvs/myproject-jQmK5Pxo/lib/python3.6/site-packages/twisted/python/failure.py", line 488, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''

I think a good approach would be to identify the lack of the scheme on CRAWLERA_URL and throw a descriptive expection. This can be done at spider_open signal we listen to on CrawleraMiddleware.

Gallaecio commented 5 years ago

There’s a pull requests implementing a similar improvement on scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash/pull/214

ejulio commented 5 years ago

I'd also add to ensure http:// instead of https://. I had some troubles setting crawlera url with https:// :sweat_smile:

ejulio commented 5 years ago

Created https://github.com/scrapy-plugins/scrapy-crawlera/pull/81 to fix it cc @Gallaecio , @raphapassini , @hcoura, @denisgermano

scrapy-plugins / scrapy-zyte-smartproxy

Enhance exception description when CRAWLERA_URL is missing the scheme #77