lemire / FastPFor

The FastPFOR C++ library: Fast integer compression
Apache License 2.0
849 stars 124 forks source link

Synthetic data #47

Closed amallia closed 5 years ago

amallia commented 5 years ago

https://github.com/lemire/FastPFor/blob/71d54a9793245ae90e69c86a425d4ee1ee6543d8/headers/synthetic.h#L102-L128

The above is the synthetic data generator for a clustered series. The reference is "Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words".

The original paper says the following:

The second family, ClusterData, has a structure similar to UniformData. However, the sequence is generated in a way that creates a clustered rather than uniform distribution, to more closely model the way that terms in an information retrieval system tend to be clustered across clumps of documents. A recursive process is used to set, in a clustered manner, f bits of the array A[l ...r] of bits, where f ≤r −l +1. If f is small ( f <10) then f locations in A[l ...r] are selected, and the corresponding bits are turned on. When f ≥10, the array A is randomly divided into two sub-arrays A[l ...m] and A[m+1...r] for some choice of m, and the task becomes that of turning on f/2 bits in each of the sub-arrays, with care taken so that the number of 1-bits does not exceed either of the two sub-array lengths.

In the source code the cut (which corresponds to m in the paper) is never used to split the vector, but just to adapt min/max of the recursive calls.

lemire commented 5 years ago

They describe a bit vector whereas we implement their algorithm to generate an array of sorted integers. So when they split A[l...r] into A[l ...m] and A[m+1...r], they mean to divide the universe (1...r) into two subuniverses (l...m) and (m+1...r).

It is always possible, of course, that my implementation can be improved. If so, a pull request with an analysis would be welcome.