Synthetic data - Githubissues

https://github.com/lemire/FastPFor/blob/71d54a9793245ae90e69c86a425d4ee1ee6543d8/headers/synthetic.h#L102-L128

The above is the synthetic data generator for a clustered series. The reference is "Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words".

The original paper says the following:

The second family, ClusterData, has a structure similar to UniformData. However, the sequence is generated in a way that creates a clustered rather than uniform distribution, to more closely model the way that terms in an information retrieval system tend to be clustered across clumps of documents. A recursive process is used to set, in a clustered manner, f bits of the array A[l ...r] of bits, where f ≤r −l +1. If f is small ( f <10) then f locations in A[l ...r] are selected, and the corresponding bits are turned on. When f ≥10, the array A is randomly divided into two sub-arrays A[l ...m] and A[m+1...r] for some choice of m, and the task becomes that of turning on f/2 bits in each of the sub-arrays, with care taken so that the number of 1-bits does not exceed either of the two sub-array lengths.

In the source code the cut (which corresponds to m in the paper) is never used to split the vector, but just to adapt min/max of the recursive calls.

lemire / FastPFor

Synthetic data #47