istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Simple spider that sends keyword counts to some database. #96

Closed py-in-the-sky closed 7 years ago

py-in-the-sky commented 7 years ago

The database that the spider sends keyword counts to is assumed to be external to the Scrapy-Cluster crawling system.

madisonb commented 7 years ago

This is a good proof of concept for a custom project, however this project attempts to be as minimal as possible so that others may use and extend the framework to conduct their crawling.

Do you intend the keyword spider to be an example? If so, can you provide ideas for what part of the docs need to be updated in order to include this new spider?

The unit tests failed, as it appears like KeywordsItem does not exist, did you forget to check it in?

Overall, I think this still goes against the contributing guidelines here that state "We are trying to build an generic framework for large scale, distributed web crawling.", and this appears to be a custom implementation for the specific task of finding specific keywords within pages.

If you think of this more as a tutorial, please add unit tests, documentation, and clarification for the above points.

py-in-the-sky commented 7 years ago

Sorry about this. This was supposed to go to my fork, and I'm not sure how this pull request ended up here.