Closed MasRa closed 6 years ago
Hello, this looks mostly ok, with something minor that a restart could fix. Let’s arange a teamviewer session and I can quickly fix it.
On Sun, Feb 4, 2018 at 10:09 AM MasRa notifications@github.com wrote:
Hi Could you please help with this: I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example:
_root@dev:~/book/MasoudProject/properties# scrapy crawl basic 2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties) 2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto 2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'} 2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server. 2018-02-04 14:40:25 [boto] ERROR: Caught exception reading instance data Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) URLError: <urlopen error [Errno 101] Network is unreachable> 2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up 2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines: 2018-02-04 14:40:25 [scrapy] INFO: Spider opened 2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None) 2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in
*runCallbacks current.result = callback(current.result, args, *kw) File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse item['address'] = response.xpath('//[@itemtype="http://schema.org/''Place http://schema.org/''Place"][1]/text()').extract() File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem (self.class.name, key)) KeyError: 'PropertiesItem does not support field: address' 2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished) 2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 792, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406), 'log_count/DEBUG': 3, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)} 2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished) Could you please guide me how make it true? Thank you
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scalingexcellence/scrapybook/issues/47, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwLb-lCE0GjuVI_sdF8tdBs8GSWmHZ0ks5tRcgYgaJpZM4R4mVd .
So this was while playing with your own copy that has different settings.py
than the ones in the chapter. This was the boto problem with that version of scrapy. Nothing important - just a warning essentially. The rest of crawling should be fine. One way to mitigate it is to add the following two lines in settings.py
:
# Disable S3
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
Thank you so much.
Hi Could you please help with this: I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example:
_root@dev:~/book/MasoudProject/properties# scrapy crawl basic 2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties) 2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto 2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'} 2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server. 2018-02-04 14:40:25 [boto] ERROR: Caught exception reading instance data Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) URLError: <urlopen error [Errno 101] Network is unreachable> 2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up 2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines: 2018-02-04 14:40:25 [scrapy] INFO: Spider opened 2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None) 2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks current.result = callback(current.result, args, *kw) File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse item['address'] = response.xpath('//[@itemtype="http://schema.org/''Place"][1]/text()').extract() File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem (self.class.name, key)) KeyError: 'PropertiesItem does not support field: address' 2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished) 2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 792, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406), 'log_count/DEBUG': 3, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'starttime': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)} 2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished) Could you please guide me how make it true? Thank you