Timeout error trying scrapy shell against dockerized web site

juananpe commented 8 years ago

Hi there,

I am having fun trying to set-up the Vagrant/Docker network in OSX 10.11.3 (El Capitan). First, a warning for all other OSX folks out there : don't use Vagrant 1.7.x or you will be stuck badly with non-sense errors in your console. Use the 1.8.1 (or newer version)

Then, let's go with my problem. I can see all the docker boxes. I can even ssh into them (vagrant ssh works like a charm). From there, I can see that the web box is running OK and responding HTTP queries at tcp/9312 also :

root@dev:~/book# telnet web 9312
Trying 172.17.0.2...
Connected to web.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 200 OK
Date: Tue, 15 Mar 2016 08:50:03 GMT
Content-Length: 261
Content-Type: text/html
Server: TwistedWeb/15.5.0

Resource not found. Try: <a href="properties/index_00000.html">properties</a> <a href="images">images</a>, <a href="dynamic">dynamic</a>, <a href="benchmark/">benchmark</a> <a href="maps/api/geocode/json?sensor=false&address=Camden%20Town%2C%20London">maps</a> Connection closed by foreign host.

But now, following the book (p. 113, section "The URL"), if I try to use the scrapy shell to connect to http://web:9312 , there is a timeout error that I can't grok:

root@dev:~/book# scrapy shell http://web:9312
2016-03-15 08:51:17 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-03-15 08:51:17 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-03-15 08:51:17 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2016-03-15 08:51:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-03-15 08:51:17 [boto] DEBUG: Retrieving credentials from metadata server.
2016-03-15 08:51:18 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2016-03-15 08:51:18 [boto] ERROR: Unable to read instance data, giving up
2016-03-15 08:51:18 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 08:51:18 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 08:51:18 [scrapy] INFO: Enabled item pipelines:
2016-03-15 08:51:18 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 08:51:18 [scrapy] INFO: Spider opened
2016-03-15 08:51:18 [scrapy] DEBUG: Crawled (200) <GET http://web:9312> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fd50cd79b10>
[s]   item       {}
[s]   request    <GET http://web:9312>
[s]   response   <200 http://web:9312>
[s]   settings   <scrapy.settings.Settings object at 0x7fd50cd79a90>
[s]   spider     <DefaultSpider 'default' at 0x7fd50bc80b50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

Any help will be much appreciated.

Greetings,

Juanan

juananpe commented 8 years ago

Answering to myself, it seems that the boto library is showing this error when it can't connect to an S3/AWS host. Just set the variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to an empty string and your are good to go:

$ export AWS_ACCESS_KEY_ID="" && export AWS_SECRET_ACCESS_KEY="" && scrapy shell http://web:9312/properties/property_000000.html

(I suppose that there is a more elegant way to bypass this error, but that line works for me :)

lookfwd commented 8 years ago

Thanks a lot @juananpe for the clarification. This refers to page 113 of the e-book, or page 92 of the printed book.

You are right on exporting AWS_* credentials. This is exactly what I do as well in settings.py for Chapter 5 and any other chapter. As you say - it's just annoying boto detail and a non-elegant workaround is fine.

Hopefully this won't be a very common problem. If you run your scrapy shell command from within ch05/properties directory, it should work fine because scrapy shell includes automatically settings.py. My guess is you ran scrapy shell from a top-level directory.

It's great clarification. Thanks a million!

find4u commented 7 years ago

I just added the following to the setting.py and hey presto # Disable S3 AWS_ACCESS_KEY_ID = "" AWS_SECRET_ACCESS_KEY = ""

Hope this helps someone else

Newbie to Py and Scrapy

Richard

scalingexcellence / scrapybook

Timeout error trying scrapy shell against dockerized web site #3