Open rmax opened 8 years ago
This could be a dupefilter class.
A dupefilter based on a bloom filter can be dangerous because some requests may be incorrectly dropped: a bloom filter can only be 100% trusted when it says the request is not seen.
@kmike, hi Why "100% trusted when it says the request is not seen" isn't enough?
@rafaelcapucho
@kmike Thank you,
We need to use both request is not seen and request is seen in decide to process or not a request? Then we can process only when the request is not seen.
Please, tell me if I'm wrong, thx
@rafaelcapucho I mean that Scrapy asks dupefilter a question: "is this request seen?". There are two possible answers:
When Bloom filter says "request is not seen" then the request is truly not seen. Because the request is new, Scrapy spider goes and downloads a page; it can do this with confidence.
When a Bloom filter says "request is seen" Scrapy should drop the request and avoid downloading it. This is the main and the only purpose of a dupefilter - detect seen requests and avoid processing them. The problem is that when a Bloom filter says "request is seen" there is some probability that request was not seen before, and a filter made a mistake. It means Scrapy can drop innocent requests if a Bloom filter is used for duplicate checks.
@kmike Thank you, now I understood the problem :)
@kmike good point!
Seems we don't need bloom filter in our case, SADD
from Redis
already gives us an O(1) speed.
From https://github.com/rolando/scrapy-redis/issues/37#issuecomment-193811100