Check url already had to decrease crawling time

scrapinghub / portia

Visual scraping for Scrapy

BSD 3-Clause "New" or "Revised" License

9.28k stars 1.41k forks source link

Check url already had to decrease crawling time #642

Closed vokhuongcse0604 closed 7 years ago

vokhuongcse0604 commented 7 years ago

I want to update newest newspaper in site. So, I set DEPTH_LIMIT = 2 and DEPTH_PRIORITY = 1, but crawling time still so long. I want to check url in database (redis) stored. If url already had, i am not go to children links. I think crawling time is shorter. How can I implement it?

ruairif commented 7 years ago

You can create a middleware that saves urls to redis and before scheduling a request it checks if the url is in redis. If it is then it doesn't load it. You can find more information about how to implement this here