scalingexcellence / scrapybook

Scrapy Book Code
http://scrapybook.com/
475 stars 209 forks source link

problem in example page 46 (populating an item) #47

Closed MasRa closed 6 years ago

MasRa commented 6 years ago

Hi Could you please help with this: I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example:

_root@dev:~/book/MasoudProject/properties# scrapy crawl basic 2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties) 2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto 2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'} 2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server. 2018-02-04 14:40:25 [boto] ERROR: Caught exception reading instance data Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) URLError: <urlopen error [Errno 101] Network is unreachable> 2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up 2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines: 2018-02-04 14:40:25 [scrapy] INFO: Spider opened 2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None) 2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks current.result = callback(current.result, args, *kw) File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse item['address'] = response.xpath('//[@itemtype="http://schema.org/''Place"][1]/text()').extract() File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem (self.class.name, key)) KeyError: 'PropertiesItem does not support field: address' 2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished) 2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 792, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406), 'log_count/DEBUG': 3, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'starttime': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)} 2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished) Could you please guide me how make it true? Thank you

lookfwd commented 6 years ago

Hello, this looks mostly ok, with something minor that a restart could fix. Let’s arange a teamviewer session and I can quickly fix it.

On Sun, Feb 4, 2018 at 10:09 AM MasRa notifications@github.com wrote:

Hi Could you please help with this: I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example:

_root@dev:~/book/MasoudProject/properties# scrapy crawl basic 2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties) 2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto 2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'} 2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server. 2018-02-04 14:40:25 [boto] ERROR: Caught exception reading instance data Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) URLError: <urlopen error [Errno 101] Network is unreachable> 2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up 2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines: 2018-02-04 14:40:25 [scrapy] INFO: Spider opened 2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None) 2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in

*runCallbacks current.result = callback(current.result, args, *kw) File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse item['address'] = response.xpath('//[@itemtype="http://schema.org/''Place http://schema.org/''Place"][1]/text()').extract() File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem (self.class.name, key)) KeyError: 'PropertiesItem does not support field: address' 2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished) 2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 792, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406), 'log_count/DEBUG': 3, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)} 2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished) Could you please guide me how make it true? Thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scalingexcellence/scrapybook/issues/47, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwLb-lCE0GjuVI_sdF8tdBs8GSWmHZ0ks5tRcgYgaJpZM4R4mVd .

lookfwd commented 6 years ago

So this was while playing with your own copy that has different settings.py than the ones in the chapter. This was the boto problem with that version of scrapy. Nothing important - just a warning essentially. The rest of crawling should be fine. One way to mitigate it is to add the following two lines in settings.py:

# Disable S3
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
MasRa commented 6 years ago

Thank you so much.