I want to update newest newspaper in site. So, I set DEPTH_LIMIT = 2 and DEPTH_PRIORITY = 1, but crawling time still so long. I want to check url in database (redis) stored. If url already had, i am not go to children links. I think crawling time is shorter. How can I implement it?
You can create a middleware that saves urls to redis and before scheduling a request it checks if the url is in redis. If it is then it doesn't load it. You can find more information about how to implement this here
I want to update newest newspaper in site. So, I set DEPTH_LIMIT = 2 and DEPTH_PRIORITY = 1, but crawling time still so long. I want to check url in database (redis) stored. If url already had, i am not go to children links. I think crawling time is shorter. How can I implement it?