Closed leopeng1995 closed 7 years ago
What exactly does this relate to? A bloom filter as I understand is a probabilistic data structure that doesn't always guarantee true positives. Why do we want to use this and in what context?
I also note that Redis already has a HyperLogLog that may serve a similar purpose for whatever you are suggesting here.
Yes, it doesn't guarantee the true positives. It saves the spaces for large-scale website urls and speeds up the duplication-url-checking process. If this url hasn't been crawled, it must be false in Bloom Filter. If showing true, it doesn't means this url has been crawled. And then we checks this url in deeper.
Okay, I will learn about the concept of HyperLogLog in Redis. Thx.
@leopeng1995 I guess my question then is why would I want a deduplication filter that doesnt guarantee it will always be correct? For really large crawls you risk crawling millions of urls more than you need to, because the probabilistic data structure said the url wasnt in there, but in reality we have already seen it.
This puts extra load on the cluster, and can cause data duplication or processing problems. If you crawl one page at time t0, then the bloom filter says we haven't crawled it yet, and it changes when you subsequently crawl it again at time t1, you now have two different results because of a filtering error.
You then require extra processing, time, and validation of the data you receive from your crawlers. If that is worth it to you I think the bloom filter could fit right into the deduplication logic, but from what I understand I think that is a custom implementation or desired functionality, and not one this project supports. I dont see a reason to change the core "out of the box" filtering when I can't guarantee the crawls are not always valid or completely controlled.
@madisonb Thanks for your reply. :-) I think this feature should be added in custom plugin. I will consider to implement it.
We can use the "bit" concept of Redis to construct a Bloom Filter, For example (from another one):