scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

KeyError: b'frontier_request' #337

Closed nmweizi closed 6 years ago

nmweizi commented 6 years ago

2018-07-28 16:32:46 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: nmgkInfoCrawl) 2018-07-28 16:32:46 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (default, Jan 4 2018, 16:40:53) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core 2018-07-28 16:32:46 [scrapy.crawler] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'BOT_NAME': 'nmgkInfoCrawl', 'CONCURRENT_REQUESTS': 64, 'COOKIES_ENABLED': False, 'HTTPCACHE_IGNORE_HTTP_CODES': [403], 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'nmgkInfoCrawl.spiders', 'RETRY_TIMES': 5, 'SCHEDULER': 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler', 'SPIDER_MODULES': ['nmgkInfoCrawl.spiders']} 2018-07-28 16:32:46 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2018-07-28 16:32:46 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats', 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware'] 2018-07-28 16:32:46 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware', 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware'] 2018-07-28 16:32:46 [scrapy.middleware] INFO: Enabled item pipelines: ['nmgkInfoCrawl.save_sqlite.scrapyPipeline_sqlite'] 2018-07-28 16:32:46 [scrapy.core.engine] INFO: Spider opened 2018-07-28 16:32:46 [manager] INFO: -------------------------------------------------------------------------------- 2018-07-28 16:32:46 [manager] INFO: Starting Frontier Manager... 2018-07-28 16:32:46 [manager] INFO: Frontier Manager Started! 2018-07-28 16:32:46 [manager] INFO: -------------------------------------------------------------------------------- 2018-07-28 16:32:46 [frontera.contrib.scrapy.schedulers.FronteraScheduler] INFO: Starting frontier 2018-07-28 16:32:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-07-28 1

2018-07-28 16:12:14 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.nm.zsks.cn/18gkwb/index_5.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/local/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/usr/local/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in return (_set_referer(r) for r in result or ()) File "/usr/local/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "/usr/local/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/usr/local/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output frontier_request = response.meta[b'frontier_request'] KeyError: b'frontier_request'

sibiryakov commented 6 years ago

Hi @nmweizi it looks like this request was generated by Scrapy (not Frontera). Could you post here your spider code and crawling strategy?

sibiryakov commented 6 years ago

Please note, seeds addition is moved outside of Scrapy and delegated to Frontera, since 0.8.

nmweizi commented 6 years ago

@sibiryakov is fine.thx.