realpython / stack-spider

https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/
167 stars 78 forks source link

test Scrapy crawler does not return anything #5

Open ilovefood2 opened 5 years ago

ilovefood2 commented 5 years ago

I know this is an old post but I followed https://realpython.com/web-scraping-and-crawling-with-scrapy-and-mongodb/ part 2 and tested with download source v2
I used Scrapy crawl stack_crawler command I don't get any return, no errors either.

any idea where the problem is?

ilovefood2 commented 5 years ago

019-08-19 23:05:54 [scrapy] INFO: Scrapy 1.0.3 started (bot: stack) 2019-08-19 23:05:54 [scrapy] INFO: Optional features available: ssl, http11, boto 2019-08-19 23:05:54 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0', 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'stack'} 2019-08-19 23:05:54 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2019-08-19 23:05:54 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2019-08-19 23:05:54 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2019-08-19 23:05:54 [py.warnings] WARNING: /usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/pipelines/init.py:21: ScrapyDeprecationWarning: ITEM_PIPELINES defined as a list or a set is deprecated, switch to a dict category=ScrapyDeprecationWarning, stacklevel=1)

2019-08-19 23:05:55 [py.warnings] WARNING: /Users/kelvin/Downloads/stack-spider-2/stack/stack/pipelines.py:17: ScrapyDeprecationWarning: Module scrapy.log has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more. from scrapy import log

2019-08-19 23:05:55 [scrapy] INFO: Enabled item pipelines: MongoDBPipeline 2019-08-19 23:05:55 [scrapy] INFO: Spider opened 2019-08-19 23:05:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-08-19 23:05:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2019-08-19 23:05:55 [scrapy] DEBUG: Redirecting (301) to <GET https://stackoverflow.com/questions?pagesize=50&sort=newest> from <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> 2019-08-19 23:06:02 [scrapy] DEBUG: Crawled (200) <GET https://stackoverflow.com/questions?pagesize=50&sort=newest> (referer: None) 2019-08-19 23:06:02 [scrapy] INFO: Closing spider (finished) 2019-08-19 23:06:02 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 617, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 55541, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 8, 20, 3, 6, 2, 615429), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'log_count/WARNING': 2, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 8, 20, 3, 5, 55, 67411)} 2019-08-19 23:06:02 [scrapy] INFO: Spider closed (finished)

ilovefood2 commented 5 years ago

Scrapy crawl stack works fine given that I added correct header and DOWNLOAD_HANDLERS = {'s3': None}

only stack_crawler doesn't work

anhnguyenkim-agilityio commented 4 years ago

oh, I got the same error and try crazy search but can't resolve this issue. Anyone can help, please???