Unclear documentation regarding outbound crawled data integration

synergiator commented 6 years ago

Maybe I just oversee what I am looking for in these excellent and comprehensive docs.

Which steps need to be done to setup and run a custom crawler - including rebuild of Docker images and cluster restart, or not?
How to use/consume/forward/process crawled data, do you connect to a Kafka topic API, or Redis API (e.g. a consumer to write out a (HDFS) JSON/CSV, or write data to a database interface)? Or the custom REST component? Or does it start from "Extensions"?

After doing "your first crawl"/"typical usage" parts I feel somehow lost. Would be great to have more outline or a separate tutorial on that.

madisonb commented 6 years ago

1) Yes, you would rebuild the docker container (or make your own) if you made a different crawler, deploy it/pull it down, and let it rejoin the cluster 2) Given that all data goes out into Apache Kafka, you would use a stream processing framework like Flink, Storm, or even a script that could handle the volume of data you are reading. I would not recommend the rest component for this task

After your first crawl, it is really up to you what you would like to do with the project - this is just meant to be a starter for distributed crawling via Redis, Kafka, and Scrapy. Add your own pipelines, add your own crawlers, add your own middlewares, the sky is the limit!

If this answers your question please close this ticket.

madisonb commented 6 years ago

Closing due to inactivity. I think this project gives good groundwork to then let you go off and modify it to fit your needs.

istresearch / scrapy-cluster

Unclear documentation regarding outbound crawled data integration #155