Add example integration with pyrebloom

rmax / scrapy-redis

Redis-based components for Scrapy.

http://scrapy-redis.readthedocs.io

MIT License

5.52k stars 1.59k forks source link

Add example integration with pyrebloom #56

Open rmax opened 8 years ago

rmax commented 8 years ago

From https://github.com/rolando/scrapy-redis/issues/37#issuecomment-193811100

rmax commented 8 years ago

This could be a dupefilter class.

kmike commented 8 years ago

A dupefilter based on a bloom filter can be dangerous because some requests may be incorrectly dropped: a bloom filter can only be 100% trusted when it says the request is not seen.

rafaelcapucho commented 8 years ago

@kmike, hi Why "100% trusted when it says the request is not seen" isn't enough?

kmike commented 8 years ago

@rafaelcapucho

request is not seen -> process it and add to seen requests. This always works properly, so there are no requests processed more than 1 time.
request is seen -> drop the request. Sometimes Bloom filter may tell that request is seen while it is not, so there can be incorrectly dropped requests.

rafaelcapucho commented 8 years ago

@kmike Thank you,

We need to use both request is not seen and request is seen in decide to process or not a request? Then we can process only when the request is not seen.

Please, tell me if I'm wrong, thx

kmike commented 8 years ago

@rafaelcapucho I mean that Scrapy asks dupefilter a question: "is this request seen?". There are two possible answers:

yep, request is seen;
nope, request is not seen.

When Bloom filter says "request is not seen" then the request is truly not seen. Because the request is new, Scrapy spider goes and downloads a page; it can do this with confidence.

When a Bloom filter says "request is seen" Scrapy should drop the request and avoid downloading it. This is the main and the only purpose of a dupefilter - detect seen requests and avoid processing them. The problem is that when a Bloom filter says "request is seen" there is some probability that request was not seen before, and a filter made a mistake. It means Scrapy can drop innocent requests if a Bloom filter is used for duplicate checks.

rafaelcapucho commented 8 years ago

@kmike Thank you, now I understood the problem :)

rmax commented 8 years ago

@kmike good point!

LuckyPigeon commented 1 year ago

Seems we don't need bloom filter in our case, SADD from Redis already gives us an O(1) speed.