istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Bloom Filter based on Redis #105

Closed leopeng1995 closed 7 years ago

leopeng1995 commented 7 years ago

We can use the "bit" concept of Redis to construct a Bloom Filter, For example (from another one):

# encoding=utf-8

import redis
from hashlib import md5

class SimpleHash(object):
    def __init__(self, cap, seed):
        self.cap = cap
        self.seed = seed

    def hash(self, value):
        ret = 0
        for i in range(len(value)):
            ret += self.seed * ret + ord(value[i])
        return (self.cap - 1) & ret

class BloomFilter(object):
    def __init__(self, host='localhost', port=6379, db=0, blockNum=1, key='bloomfilter'):
        """
        :param host: the host of Redis
        :param port: the port of Redis
        :param db: witch db in Redis
        :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it.
        :param key: the key's name in Redis
        """
        self.server = redis.Redis(host=host, port=port, db=db)
        self.bit_size = 1 << 31
        self.seeds = [5, 7, 11, 13, 31, 37, 61]
        self.key = key
        self.blockNum = blockNum
        self.hashfunc = []
        for seed in self.seeds:
            self.hashfunc.append(SimpleHash(self.bit_size, seed))

    def isContains(self, str_input):
        if not str_input:
            return False
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        ret = True
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)
        for f in self.hashfunc:
            loc = f.hash(str_input)
            ret = ret & self.server.getbit(name, loc)
        return ret

    def insert(self, str_input):
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)
        for f in self.hashfunc:
            loc = f.hash(str_input)
            self.server.setbit(name, loc, 1)

if __name__ == '__main__':
    bf = BloomFilter()
    if bf.isContains('http://www.google.com'): 
        print 'exists!'
    else:
        print 'not exists!'
        bf.insert('http://www.google.com')
madisonb commented 7 years ago

What exactly does this relate to? A bloom filter as I understand is a probabilistic data structure that doesn't always guarantee true positives. Why do we want to use this and in what context?

I also note that Redis already has a HyperLogLog that may serve a similar purpose for whatever you are suggesting here.

leopeng1995 commented 7 years ago

Yes, it doesn't guarantee the true positives. It saves the spaces for large-scale website urls and speeds up the duplication-url-checking process. If this url hasn't been crawled, it must be false in Bloom Filter. If showing true, it doesn't means this url has been crawled. And then we checks this url in deeper.

Okay, I will learn about the concept of HyperLogLog in Redis. Thx.

madisonb commented 7 years ago

@leopeng1995 I guess my question then is why would I want a deduplication filter that doesnt guarantee it will always be correct? For really large crawls you risk crawling millions of urls more than you need to, because the probabilistic data structure said the url wasnt in there, but in reality we have already seen it.

This puts extra load on the cluster, and can cause data duplication or processing problems. If you crawl one page at time t0, then the bloom filter says we haven't crawled it yet, and it changes when you subsequently crawl it again at time t1, you now have two different results because of a filtering error.

You then require extra processing, time, and validation of the data you receive from your crawlers. If that is worth it to you I think the bloom filter could fit right into the deduplication logic, but from what I understand I think that is a custom implementation or desired functionality, and not one this project supports. I dont see a reason to change the core "out of the box" filtering when I can't guarantee the crawls are not always valid or completely controlled.

leopeng1995 commented 7 years ago

@madisonb Thanks for your reply. :-) I think this feature should be added in custom plugin. I will consider to implement it.