rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

The distributed scrappy crawler idling after requesting for a while and still having a list of requests in redis #152

Closed hewm closed 1 year ago

hewm commented 5 years ago

When I open the scrapy program, the program executes normally. But after dozens of minutes, my program recorded the keet print I found that scrapy went through the request middleware and got the proxy, but no parse was parsed. In other words, it might be scrapy to get an empty url or just not get the url. Just after going to a proxy, I don't know what to do. I added a log to the top of each parsed parse method to print the url of the current request. I found that this log was not returned, and all the requests were broken after DOWNLOADER_MIDDLEWARES and have not been reported. My log level is info.

I deployed several sets of crawler code-like crawlers on one machine, and redis is on the same machine. After running for a while and redis requests and requests, the scrapy crawler has a long idle situation.

The idling time is as long as five to six hours, after which it can be normally requested for a while, then it will start to idling again.

log is

2019-08-16 14:50:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min) 2019-08-16 14:51:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min) 2019-08-16 14:52:09 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:12 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:16 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:19 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:23 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:26 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:29 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:33 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:52:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min) 2019-08-16 14:53:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min) 2019-08-16 14:54:16 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:19 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:23 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:26 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:30 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:33 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:36 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:40 [dgk_update_detail] INFO: [proxy] https://3.113.251.65:3128 2019-08-16 14:54:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min) 2019-08-16 14:55:57 [scrapy.extensions.logstats] INFO: Crawled 29795 pages (at 0 pages/min), scraped 28644 items (at 0 items/min)