istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Unclear documentation regarding outbound crawled data integration #155

Closed synergiator closed 6 years ago

synergiator commented 6 years ago

Maybe I just oversee what I am looking for in these excellent and comprehensive docs.

After doing "your first crawl"/"typical usage" parts I feel somehow lost. Would be great to have more outline or a separate tutorial on that.

madisonb commented 6 years ago

1) Yes, you would rebuild the docker container (or make your own) if you made a different crawler, deploy it/pull it down, and let it rejoin the cluster 2) Given that all data goes out into Apache Kafka, you would use a stream processing framework like Flink, Storm, or even a script that could handle the volume of data you are reading. I would not recommend the rest component for this task

After your first crawl, it is really up to you what you would like to do with the project - this is just meant to be a starter for distributed crawling via Redis, Kafka, and Scrapy. Add your own pipelines, add your own crawlers, add your own middlewares, the sky is the limit!

If this answers your question please close this ticket.

madisonb commented 6 years ago

Closing due to inactivity. I think this project gives good groundwork to then let you go off and modify it to fit your needs.