libminisketch
is an optimized standalone MIT-licensed library with C API for constructing and decoding set sketches, which can be used for compact set reconciliation and other applications.
It is an implementation of the PinSketch[1] algorithm. An explanation of the algorithm can be found here.
Sketches, as produced by this library, can be seen as "set checksums" with two peculiar properties:
libminisketch
will always recover the entire set from the sketch. A sketch of b-bit elements with capacity c can be stored in bc bits.This makes them appropriate for a very bandwidth-efficient set reconciliation protocol. If Alice and Bob each have a set of elements, and they suspect that the sets largely but not entirely overlap, they can use the following protocol to let both parties learn all the elements:
This will always succeed when the size of the difference (elements that Alice has but Bob doesn't plus elements that Bob has but Alice doesn't) does not exceed the capacity of the sketch that Alice sent. The interesting part is that this works regardless of the actual set sizes—only the difference matters.
If the elements are large, it may be preferable to compute the sketches over hashes of the set elements. In that case an additional step is added to the protocol, where Bob also sends the hash of every element he does not have to Alice, who responds with the requested elements.
The doc/ directory has additional tips for designing reconciliation protocols using libminisketch.
The first graph above shows a benchmark of libminisketch
against three other set reconciliation algorithms/implementations. The benchmarks were performed using a single core on a system with an Intel Core i7-7820HQ CPU with clock speed locked at 2.4 GHz. The diagram shows the time needed for merging of two sketches and decoding the result. The creation of a sketch on the same machine takes around 5 ns per capacity and per set element. The other implementations are:
pinsketch
, the original PinSketch implementation.cpisync
, a software project which implements a number of set reconciliation algorithms and protocols. The included benchmark analyzes the non-probabilistic version of the original CPISync algorithm[5] only.For the largest sizes currently of interest to the authors, such as a set of capacity 4096 with 1024 differences, libminisketch
is forty-nine times faster than pinsketch
and over eight thousand times faster than cpisync
. libminisketch
is fast enough on realistic set sizes for use on high-traffic network servers where computational resources are limited.
Even where performance is latency-limited, small minisketches can be fast enough to improve performance. On the above i7-7820HQ, a set of 2500 30-bit entries with a difference of 20 elements can be communicated in less time with a minisketch than sending the raw set so long as the communications bandwidth is 1 gigabit per second or less; an eight-element difference can be communicated in better than one-fifth the time on a gigabit link.
The second graph above shows the performance of the same algorithms on the same system, but this time keeping the capacity constant at 128, while varying the number of differences to reconcile between 1 and 128. It shows how cpisync
's reconciliation speed is mostly dependent on capacity, while pinsketch
/libminisketch
are more dependent on number of differences.
The third graph above shows the size overhead of a typical IBLT scheme over the other algorithms (which are near-optimal bandwidth), for various levels of failure probability. IBLT takes tens of times the bandwidth of libminisketch
sketches when the set difference size is small and the required failure rate is low.
The fourth graph above shows the effect of the field size on speed in libminisketch
. The three lines correspond to:
It shows how CLMUL implementations are faster for certain fields (specifically, field sizes for which an irreducible polynomial of the form xb + x + 1 over GF(2) exists, and to a lesser extent, fields which are a multiple of 8 bits). It also shows how (for now) a significant performance drop exists for fields larger than 32 bits on 32-bit platforms. Note that the three lines are not at the same scale (the Raspberry Pi 3B is around 10x slower for 32-bit fields than the Core i7; the POWER9 is around 1.3x slower).
Below we compare the PinSketch algorithm (which libminisketch
is an implementation of) with other set reconciliation algorithms:
Algorithm | Sketch size | Decode success | Decoding complexity | Difference type | Secure sketch |
---|---|---|---|---|---|
CPISync[2] | (b+1)c | Always | O(n3) | Both | Yes |
PinSketch[1] | bc | Always | O(n2) | Symmetric only | Yes |
IBLT[6][7] | αbc (see graph 3) | Probabilistic | O(n) | Depends | No |
The build system is very rudimentary for now, and improvements are welcome.
The following may work and produce a libminisketch.a
file you can link against:
git clone https://github.com/sipa/minisketch
cd minisketch
./autogen.sh && ./configure && make
In this section Alice and Bob are trying to find the difference between their sets. Alice has the set [3000 ... 3009], while Bob has [3002 ... 3011].
First, Alice creates a sketch:
#include <stdio.h>
#include <assert.h>
#include "../include/minisketch.h"
int main(void) {
minisketch *sketch_a = minisketch_create(12, 0, 4);
The arguments are:
Then Alice adds her elements to her sketch. Note that adding the same element a second time removes it again, as sketches have set semantics, not multiset semantics.
for (int i = 3000; i < 3010; ++i) {
minisketch_add_uint64(sketch_a, i);
}
The next step is serializing the sketch into a byte array:
size_t sersize = minisketch_serialized_size(sketch_a);
assert(sersize == 12 * 4 / 8); // 4 12-bit values is 6 bytes.
unsigned char *buffer_a = malloc(sersize);
minisketch_serialize(sketch_a, buffer_a);
minisketch_destroy(sketch_a);
The contents of the buffer can then be submitted to Bob, who can create his own sketch:
minisketch *sketch_b = minisketch_create(12, 0, 4); // Bob's own sketch
for (int i = 3002; i < 3012; ++i) {
minisketch_add_uint64(sketch_b, i);
}
After Bob receives Alice's serialized sketch, he can reconcile:
sketch_a = minisketch_create(12, 0, 4); // Alice's sketch
minisketch_deserialize(sketch_a, buffer_a); // Load Alice's sketch
free(buffer_a);
// Merge the elements from sketch_a into sketch_b. The result is a sketch_b
// which contains all elements that occurred in Alice's or Bob's sets, but not
// in both.
minisketch_merge(sketch_b, sketch_a);
uint64_t differences[4];
ssize_t num_differences = minisketch_decode(sketch_b, 4, differences);
minisketch_destroy(sketch_a);
minisketch_destroy(sketch_b);
if (num_differences < 0) {
printf("More than 4 differences!\n");
} else {
ssize_t i;
for (i = 0; i < num_differences; ++i) {
printf("%u is in only one of the two sets\n", (unsigned)differences[i]);
}
}
}
In this example Bob would see output such as:
$ gcc -std=c99 -Wall -Wextra -o example ./doc/example.c -Lsrc/ -lminisketch -lstdc++ && ./example
3000 is in only one of the two sets
3011 is in only one of the two sets
3001 is in only one of the two sets
3010 is in only one of the two sets
The order of the output is arbitrary and will differ on different runs of minisketch_decode().
Communications efficient set reconciliation has been proposed to optimize Bitcoin transaction distribution[8], which would allow Bitcoin nodes to have many more peers while reducing bandwidth usage. It could also be used for Bitcoin block distribution[9], particularly for very low bandwidth links such as satellite. A similar approach (CPISync) is used by PGP SKS keyservers to synchronize their databases efficiently. Secure sketches can also be used as helper data to reliably extract a consistent cryptographic key from fuzzy biometric data while leaking minimal information[1]. They can be combined with dcnets to create cryptographic multiparty anonymous communication[10].
libminisketch
is written in C++11, but has a C API for compatibility reasons.
Specific algorithms and optimizations used:
Some improvements that are still TODO: