Particularly when using sha256, it seems like this should be full of parallelism to exploit.
Some basic tests of using concurrency in the filter add method (making an add_all) show that on a powerful CPU, the overhead of passing out SHA256 jobs and collecting them back eats up the majority of the savings given by parallelizing.
E.g., adding these methods to the Bloomer object:
def __produce_all_futures(self, keyIter, executor):
for key in keyIter:
for i in range(self.nHashFuncs):
yield executor.submit(self.hash, hash_no=i, key=key)
def add_all(self, keyIter, executor):
all_futures = self.__produce_all_futures(keyIter, executor)
for future in concurrent.futures.as_completed(all_futures):
self.bitarray[future.result()] = True
and then in initialize using it as filter.add_all(include, executor) provides no speedup with 1M key dataset and 64 threads. This code avoids the GIL by single-threading the bitarray updates after all the hashes are done, but clearly that's not a winning strategy.
A similar contains_all for the false-positive calculation is not particularly useful addition to Bloomer:
def contains_all(self, keyIter, executor):
futures_to_key = {}
for key in keyIter:
for i in range(self.nHashFuncs):
futures_to_key[executor.submit(self.hash, hash_no=i, key=key)] = key
for future in concurrent.futures.as_completed(futures_to_key):
key = futures_to_key[future]
yield (key, self.bitarray[future.result()])
This gets called in initialize as:
for elem, in_filter in filter.contains_all(exclude, executor):
exclude_count += 1
if in_filter:
false_positives.add(elem)
More speedup is obvious in the verify function, but it's more complex to write.
It's possible that parallelism would be best exploited at the Cascade object layer, instead of the Filter. The most obvious would be to start calculating deeper layers as soon as a false-positive is caught, but that requires considerable structural updates, and might not itself be a big enough gain to be worth the engineering.
Probably I just need to think on this some more, but in the mean time, a 55M entry dataset on a single thread takes about 10 minutes to crunch, assuming it all fits in memory. (Add i/o time if not!)
Particularly when using sha256, it seems like this should be full of parallelism to exploit.
Some basic tests of using concurrency in the filter
add
method (making anadd_all
) show that on a powerful CPU, the overhead of passing out SHA256 jobs and collecting them back eats up the majority of the savings given by parallelizing.E.g., adding these methods to the
Bloomer
object:and then in initialize using it as
filter.add_all(include, executor)
provides no speedup with 1M key dataset and 64 threads. This code avoids the GIL by single-threading thebitarray
updates after all the hashes are done, but clearly that's not a winning strategy.A similar contains_all for the false-positive calculation is not particularly useful addition to
Bloomer
:This gets called in
initialize
as:More speedup is obvious in the
verify
function, but it's more complex to write.It's possible that parallelism would be best exploited at the
Cascade
object layer, instead of theFilter
. The most obvious would be to start calculating deeper layers as soon as a false-positive is caught, but that requires considerable structural updates, and might not itself be a big enough gain to be worth the engineering.Probably I just need to think on this some more, but in the mean time, a 55M entry dataset on a single thread takes about 10 minutes to crunch, assuming it all fits in memory. (Add i/o time if not!)