istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Help processing pages returned by crawler #187

Closed hellsingnorevy closed 6 years ago

hellsingnorevy commented 6 years ago

Hi there,

I've been using scrapy for a while, to collect data in order to teach myself some big data analytics and when I found about scrapy-cluster I thought it would be great to be able to scale up and get even more data.

I've been trying to get everything working for a couple of weeks now, but I can't seem to get the kafka-monitor plugins working.

The plugin is defined and loads up fine according to the logs, but if I'm not mistaken, I should be able to use a plugin to consume the page returned by the crawler, so I can store it in my database but I haven't been able to process anything returned by the crawler.

Do I have to use a scrapy pipeline to store the data in the database? Can't I use a plugin for that? If so, can anyone point me in the right direction? I can't seem to find any information on this on the documentation.

Thanks in advance!

madisonb commented 6 years ago

@hellsingnorevy the kafka monitor (in its default form) is used to ingest crawl requests only, as it listens on a specific kafka topic. If you wish to have crawl data processed as well, you should have a different instance of the kafka monitor and have it listen to the crawl firehose topic that the crawler generates.

You can use whatever you want to store data in a database, however I would always recommend keeping your spiders lean and doing that kind of logic after you consume from the firehose topic.

Hopefully the architecture diagram here helps shed some light into what is going on. You might want to steal some code from the kafkadump.py file to read from a kafka topic, and then you can do whatever logic you please with the json object.

Given that this is a "personal setup" question, do you mind if I close this issue and direct you to our Gitter chat room? That place is great for custom setups, and I try to keep this issue log for features, bugs, PRs, etc.

hellsingnorevy commented 6 years ago

Thank you for the prompt response, I just joined the Gitter room and will have a look at the diagram and then ask around in the chat :)

You can close this issue and we will see each other around gitter!

Thanks a lot <3