istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

How to get amount of crawled pages for specific crawl request? #51

Closed yrik closed 8 years ago

yrik commented 8 years ago

Thank you for hard work on project. It seems that current stats per crawler is limited to amount of pending items only. Is there a way to get more advanced stat that includes amount of successfully parsed pages and amount of fails? I need it per crawl request.

{u'server_time': 1458001694, u'crawlid': u'tc2', u'total_pending': 9160, u'total_domains': 1, u'spiderid': u'link', u'appid': u'apid2', u'domains': {u'dmoz.org': {u'low_priority': -29, u'high_priority': -19, u'total': 9160}}, u'uuid': u'd2afgh'}

madisonb commented 8 years ago

@yrik As of right now, no, there is no ability to get the statistics per crawl request. Reason being is that if we stored the statistics in Redis like we do with the Stats API for each crawl request, we would generate a huge amount of 200, 403, 404, and other keys per request (especially if you are doing a maxdepth>=1). I would say this just adds to the bloat and your Redis db would fill up fairly quickly. The Stats API for the spiders are for your cluster overall, not the individual crawl requests.

If you need this ability, and you need the exact counts of every crawl request at a granular level I would recommend using the generic Counter stats collector in your item pipeline, and just bump the counter by one for each response code, for each crawlid. It will probably be the least amount of work to get what you need.

yrik commented 8 years ago

Thanks