istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

scrapy-cluster don't scroll the entire pages #258

Closed RochdiBoudokhane closed 3 years ago

RochdiBoudokhane commented 3 years ago

hello there 1) i am using scrapy-cluster for scraping specifics items and scroll the entire pages(~20 pages) but when i running the spider he scroll only 8 or 9 or 10 pages do you have any suggestion how to solve it thanks 2) where the items scraped are stored and the date are persist or not 3) how to use scrapy-cluster to crawl a lots of website in real-time for crawling only the new links Thanks

madisonb commented 3 years ago

Hello,

This project provides very basic/bare bones spiders, and is mostly a tasking framework for enabling you to build your own spiders. I can't answer questions specifically about question 1 because there is no scripted scrolling in the link spider, you would need to build your own.

Items are stored in kafka per the documentation, and if you read through the docs I hope it becomes clear as to how you can conduct large scale crawls via the Kafka Monitor API.

Please close the ticket if this is not related to any actual bugs within the framework, as personal projects should use the Gitter chat for support.

RochdiBoudokhane commented 3 years ago

Thanks for your answers honestly this is not clear 1) can you explain please in details how i can show the items scraped from demo.outbound_firehose because this is not clear in the docs(python kafkadump.py dump -t demo.crawled_firehose => will this command be used to display the data). 2) please answers me from this specific question how i can run this project in real time for scraped the news items. 3) what is the role of redis in this project and what is the data stored in it. I'm very sorry for the inconvenience thank you for answering me for each question thank you very much

madisonb commented 3 years ago
  1. The data is stored in kafka, you can certainly leverage the kafkadump utility but I would recommend running an independent process to take the items out of kafka and put them in a place you desire.
  2. The docker or cluster quickstart can show you how to run your spiders. There is a sample docker compose project that you can run to get everything going.
  3. There is a detailed section in the documentation about what is stored in redis. But, the short version is that it is a temporary orchestration cache that let the spiders communicate with each other. There are various keys for the crawl queue, blacklists, domain throttles, health checks, etc. Redis is the backbone of the project.
RochdiBoudokhane commented 3 years ago

Thanks for your answers i want to stored the items scraped into redis have you any idea how to solve it. I'm very sorry for the inconvenience thank you

madisonb commented 3 years ago

I personally would not store all the scraped data in redis, since it can be pretty large and redis is an in-memory database. But, I would figure out the data structure you want from here and leverage something like redis-py to build it.

Please close this ticket since this is not a direct bug report or issue with scrapy cluster the project.

RochdiBoudokhane commented 3 years ago

Thanks for you answers just a last question(i will close the ticket then) how can I access to all scraped items because when i write this command "python kafkadump.py dump -t demo.crawled_firehose -p" i show only the new scraped items I'm very sorry for the inconvenience thank you

madisonb commented 3 years ago

Kafka is a streaming publish/subscribe model of data, it is not a true database. You can rewind the topic given it's retention policy, or dump the data off of the topic to a database of your choosing. This project publishes data, it is up to you as to where you ultimately want to store it.