mozilla / filter-cascade

A python filter cascade implementation
Mozilla Public License 2.0
7 stars 5 forks source link

Exploit Concurrency #14

Open jcjones opened 4 years ago

jcjones commented 4 years ago

Particularly when using sha256, it seems like this should be full of parallelism to exploit.

Some basic tests of using concurrency in the filter add method (making an add_all) show that on a powerful CPU, the overhead of passing out SHA256 jobs and collecting them back eats up the majority of the savings given by parallelizing.

E.g., adding these methods to the Bloomer object:

    def __produce_all_futures(self, keyIter, executor):
        for key in keyIter:
            for i in range(self.nHashFuncs):
                yield executor.submit(self.hash, hash_no=i, key=key)

    def add_all(self, keyIter, executor):
        all_futures = self.__produce_all_futures(keyIter, executor)
        for future in concurrent.futures.as_completed(all_futures):
            self.bitarray[future.result()] = True

and then in initialize using it as filter.add_all(include, executor) provides no speedup with 1M key dataset and 64 threads. This code avoids the GIL by single-threading the bitarray updates after all the hashes are done, but clearly that's not a winning strategy.

A similar contains_all for the false-positive calculation is not particularly useful addition to Bloomer:

    def contains_all(self, keyIter, executor):
        futures_to_key = {}
        for key in keyIter:
            for i in range(self.nHashFuncs):
                futures_to_key[executor.submit(self.hash, hash_no=i, key=key)] = key
        for future in concurrent.futures.as_completed(futures_to_key):
            key = futures_to_key[future]
            yield (key, self.bitarray[future.result()])

This gets called in initialize as:

                for elem, in_filter in filter.contains_all(exclude, executor):
                    exclude_count += 1
                    if in_filter:
                        false_positives.add(elem)

More speedup is obvious in the verify function, but it's more complex to write.

It's possible that parallelism would be best exploited at the Cascade object layer, instead of the Filter. The most obvious would be to start calculating deeper layers as soon as a false-positive is caught, but that requires considerable structural updates, and might not itself be a big enough gain to be worth the engineering.

Probably I just need to think on this some more, but in the mean time, a 55M entry dataset on a single thread takes about 10 minutes to crunch, assuming it all fits in memory. (Add i/o time if not!)

jcjones commented 4 years ago

There's a branch for this, https://github.com/jcjones/filter-cascade/tree/14-concurrency