tdebatty / java-LSH

A Java implementation of Locality Sensitive Hashing (LSH)
Other
294 stars 84 forks source link

Use Bitset instead of Array<boolean> #25

Open rocketraman opened 5 years ago

rocketraman commented 5 years ago

The current implementation uses a boolean[] as an input. Use of a BitSet (https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) would be a lot more efficient.

For example, if dictionary size is Integer.MAX_INT, as it would be with the "hashing shingles" approach given in 3.2.3 of Ullman et al, I need to allocate 2GB of memory to store an array of booleans. With BitSet, I can store that in approximately 8 times less space.

tdebatty commented 5 years ago

Hi,

Would be interesting, but only for very large dictionaries (probably 100 million entries or more): https://stackoverflow.com/a/605451/4770918