rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.52k stars 1.59k forks source link

Is there a way to stop spider check duplicate with redis ? #242

Open milkeasd opened 2 years ago

milkeasd commented 2 years ago

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

LuckyPigeon commented 2 years ago

@Germey Any ideas?

LuckyPigeon commented 2 years ago

@milkeasd Could you provide related code files?

LuckyPigeon commented 2 years ago

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

LuckyPigeon commented 2 years ago

@milkeasd For disable DUPEFILTER_CLASS, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

Germey commented 2 years ago

@milkeasd could you please provide your code or make some sample code?

sify21 commented 4 months ago

@LuckyPigeon it doesn't work. setting DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter" will report this error:

builtins.AttributeError: type object 'BaseDupeFilter' has no attribute 'from_spider'

Maybe there should be a custom BaseDupeFilter in scrapy-redis like RFPDupeFilter: https://github.com/rmax/scrapy-redis/blob/48a7a8921ae064fe7b4202b130f1054ede9103d6/src/scrapy_redis/dupefilter.py#L128

From scrapy's doc: https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

HairlessVillager commented 3 months ago

Hi, everyone! I've made a little change in scrapy_redis.scheduler.Scheduler, which maybe helpful for this issue. Feel free to use and comment.🥰