Add binary serialization for speed and ease

jasondavies / bloomfilter.js

JavaScript bloom filter using FNV for fast hashing

http://www.jasondavies.com/bloomfilter/

BSD 3-Clause "New" or "Revised" License

759 stars 79 forks source link

Add binary serialization for speed and ease #19

Closed connor4312 closed 5 years ago

connor4312 commented 8 years ago

This adds an official way to serialize the bloom filter (in Node) into a Buffer. This is faster than the recommended array method, and somewhat more idiomatic. It's also inherently smaller and compresses better :smile:

Benchmark:

var bloom = new BloomFilter(1000, 4);
bloom.add(1);

suite
    .add('json serialization', function() {
        var json = JSON.stringify(Array.prototype.slice.call(bloom.buckets));
        new BloomFilter(JSON.parse(json), 3);
    }).add('binary serialization', function() {
        var blob = bloom.serialize();
        BloomFilter.deserialize(blob);
    }).on('cycle', function(event) {
        console.log(String(event.target));
    }).run();

// Result:
// json serialization x 72,121 ops/sec ±2.03% (91 runs sampled)
// binary serialization x 204,417 ops/sec ±2.76% (63 runs sampled)

michaelficarra commented 8 years ago

The unnecessary [].slice.call call is likely strongly affecting the benchmark results.

connor4312 commented 8 years ago

I was following the currently recommended method of serialization from the readme. The slice call is necessary on platforms that support typed arrays, which includes Node.js and most modern browsers.

> var a = new Int32Array(3);
> a[0] = 1, a[1] = 2, a[2] = 3;
> JSON.stringify(a);
'{"0":1,"1":2,"2":3}'
> JSON.stringify(Array.prototype.slice.call(a));
'[1,2,3]'

Stringifying and then parsing the buckets without the slice call also does not allow for correct storage in bloomfilter's current implementation. Replacing the unit test I added to use JSON on the buckets rather than de/serialize (new BloomFilter(JSON.parse(JSON.stringify(start.buckets)), 4)), you'll get the following failure:

    bloom filter
      ✗ serialization
        » expected 1024,
    got  NaN (==) // bloomfilter-test.js:64
  ✗ Broken » 6 honored ∙ 1 broken (0.021s)

connor4312 commented 8 years ago

We're now using this in production and it's working well. :smile:

serv-inc commented 8 years ago

Thank you for this approach. The array version could be stored to local storage. How about this?

connor4312 commented 8 years ago

Hey, sorry, managed to miss your message before. This cannot directly be stored in localstorage, as localstorage only supports strings. However, DOMStrings in JavaScript are UTF-16 encoded, meaning that you can easily treat the array as a string and convert back and forth, like this:

function bytesToString(arr) {
  var out = '';
  for (let i = 0; i < arr.length; i += 2) {
    out += String.fromCharCode((arr[i] << 8) | arr[i + 1]);
  }
  return out;
}

function stringToBytes(str) {
  const arr = new Uint16Array(str.length * 2);
  for (let i = 0; i < str.length; i++) {
    let code = str.fromCharCode(str[i]);
    arr[i * 2 + 0] = code >> 8;
    arr[i * 2 + 1] = code;
  }
  return arr;
}

connor4312 commented 8 years ago

Any updates on this?

papandreou commented 7 years ago

@jasondavies, ping?

In the mean time I've published a public npm package with this patch: bloomfilter-papandreou@0.0.16-patch1