Open milkeasd opened 2 years ago
@Germey Any ideas?
@milkeasd Could you provide related code files?
The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS
can be two great features.
@milkeasd
For disable DUPEFILTER_CLASS
, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url
@milkeasd could you please provide your code or make some sample code?
@LuckyPigeon it doesn't work. setting DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter"
will report this error:
builtins.AttributeError: type object 'BaseDupeFilter' has no attribute 'from_spider'
Maybe there should be a custom BaseDupeFilter
in scrapy-redis like RFPDupeFilter
:
https://github.com/rmax/scrapy-redis/blob/48a7a8921ae064fe7b4202b130f1054ede9103d6/src/scrapy_redis/dupefilter.py#L128
From scrapy's doc: https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class
You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.
Hi, everyone! I've made a little change in scrapy_redis.scheduler.Scheduler
, which maybe helpful for this issue. Feel free to use and comment.🥰
My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?
Moreover, I want to stop the duplication check to reduce the number of connection.
But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.
Is there any other ways to stop the duplicate check ?
Or any ideas can help speed up the process ?
Thanks