radanalyticsio / silex

something to help you spark
Apache License 2.0
65 stars 13 forks source link

Approximate whitelist functionality for text processing #40

Closed willb closed 8 years ago

willb commented 8 years ago

An ApproximateWhitelist is a basic Bloom filter intended for holding natural-language vocabularies. It deals with String values natively and can be trained from a sequence or from an RDD of any element type T, as long as there is an implicit conversion in scope from T to String.