istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Is it necessary to override the spider's start_requests method #171

Closed newthis closed 6 years ago

newthis commented 6 years ago

Is it necessary to override the spider's start_requests method, I have read the code of RedisSpider and do not find its start_requests method. I want to know how to get the initialized seed URL and read the relevant source code.

madisonb commented 6 years ago

Eliminating start_requests are to ensure spinning up multiple spiders do not create the same crawl jobs over and over again, it is bad practice in this project to have them. There are no start_requests on purpose.

Please view the API docs here for more information about how to submit crawls to your cluster.

Closing as per community issue guidelines.

newthis commented 6 years ago

I know this usage. I wonder how the scrapy spider periodicaly get crawl tasks from the message queue and where the relevant code is in scrapy cluster project. Beacause according to the scrapy framework, in the crawler's (https://github.com/scrapy/scrapy/blob/108f8c4fd20a47bb94e010e9c6296f7ed9fdb2bd/scrapy/crawler.py) crawl method , every spider's start_requests is visited. So if you call "scrapy runspider RedisSpider", surely it follow that logic.

Do I need to modify the scrapy code ?

newthis commented 6 years ago

Thank you for your answer !