This is the work for my CS master's degree thesis, under the supervision of professors Paolo Boldi and Sebatiano Vigna.
This repository contains an extension to WebGraph, adding an implementation of the Partitioned Elias-Fano algorithm as presented by Ottaviano and Venturini in their paper. It uses classes and interfaces and structures from WebGraph, adding the Partitioned Elias-Fano (PEF) algorithm to the ones already available on the library. The implementation of PEFGraph is here.
Some performance tests have been run to compare Partitioned Elias-Fano to Elias-Fano and BVGraph from WebGraph and to the k2 tree algorithm as presented here and implemented here. The code run for the tests is WebGraphTester. Some other utilities can be found here.
Note that much of the code contained in PEFGraph is boilerplate copied over from the BVGraph class to add the implementation of some helper methods.
The idea behind Partitioned Elias-Fano as highlighted by Ottaviano and Venturini is that the Elias-Fano representation for monotonic sequences has been successfully used to compress inverted indexes with excellent performances but, even if the space used is comparable to other methods (such as γ-δ-Golomb or PForDelta), compressing a monotonic sequence using Elias-Fano doesn't exploit the fact that the adjacency list of a web graph is heavily clusterised, i.e. it's very common in a web graph to have consecutive successors. The Partitioned Elias-Fano algorithm splits the successor list into sublists to get a better compression.
Consider the Elias-Fano approach, where the space needed is 2 + log (u/m)
bit per element, with u being the upper bound of the sequence and m the number of elements to compress in the sequence. If you imagine having to compress a monotonic sequence of m consecutive numbers, all you'd need to compress it is the leftmost (or rightmost) element and the length of the sequence. Elias-Fano, on the other hand, requires the exact same space needed for a list of random values, ignoring the fact that they're all consecutive. The idea of PEF is then to divide each sequence in chunks and compress each subsequence using the most efficient algorithm.
A key point in the algorithm is that it will use a two-layer approach to compress the graph. The 'first level' contains some metadata needed to understand how the division into chunks is done and how to decompress each chunk. A 'second level' contains some finer data about the specific chunk. Finding the optimal partition isn't trivial since one one hand one would want to make the chunks as large as possible to minimise the overhead that comes with having to write extra elements on the first level, but on the other hand one would want to minimise the size of a chunk because this creates smaller universes, hence a better Elias-Fano compression for the second level.
In their paper, the authors suggest an exhaustive dynamic programming algorithm to find the optimal partition, but this would take θ(n2) time and it's not feasible for graphs as large as web graphs. The authors also proposed an approximation algorithm, that is the one used in this implementation.
The cost of each subsequence is the sum of two values: the cost for an element on the first level, that is fixed and doesn't change depending on the compression algorithm used for the second level, and the cost for the second level, that does depend on the algorithm. The first level contains three integers per chunk and one to index the bit vector of the second level, to know where the second level representation of the items of that specific chunk begins. The cost of the second level for the partition S[i,j]
, where i
and j
represent the index of the beginning and end of the sequence, and being u'
the size of the universe of the chunk and m'
the number of elements of the chunk, the cost for the second level is:
m' = u'
,m' = j − i + 1
if using Elias-Fano for S[i,j]
would use more than u'
bits, in which case a bitmap is cheaper, orm'(⌊log u'⌋ + 2)
if compressed with Elias-Fano.An important property to have the approximation algorithm work is that the cost function C is monotonic: for each i
, j
, k
so that 0 ≤ i < j < k ≤ m
then C(S[i,j]) ≤ C[i.k]
. In other words, adding elements to the end of a chunk can't make the cost lower but can, at most, not increase it.
The algorithm to find the optimal partition is reduced to finding the minimum path on a specific acyclic directed graph G. Given the sequence S of m
integers, G is the graph having a vertex v0, v1, ..., vm-1 for each of the positions of S and a final fake vertex vm that represents the end of the sequence. The interval will always be expressed including the first value and excluding the last one, and the size of the interval is therefore the difference between the two indexes. G is a complete graph in the sense that for each i
, j
where i < j ≤ m
there's an edge from i
to j
. In other words, each vertex has an edge to each of the following nodes and has an incoming edge from each preceding node.
Note that there's a one-to-one matching between the paths from v0 to vm and the partitions of S. A path π = (v0, vi1)(vi1, vi2) ... (vik-1, vim) through k edges corresponds to the partition S[0, i1 - 1][i1, i1 - 1]...[ik-1, m - 1] made of k blocks. If each edge is assigned the weight w(vi, vj) = C(S[i, j-1]) with C() being the cost function, the weight of the path is the cost to compress the corresponding partition, in bits. The problem of finding the best partition is now the problem of finding the cheapest path in G. However, this algorithm's time complexity is linear in the number of edges that, in a complete graph, is not a good solution. The authors of the paper suggest the following approximation algorithm. The authors suggest an approximation algorithm to find the partition, and it's the one implemented and used in this project.
The implementation in this repository writes and reads three vectors:
n
values, n
being the number of nodes in the graph. The element in position i
contains the position of the firstLevel vector from which the data for the node i
begins.The only element consistently containing the same thing is the one pointed by the offset array. If you read the position x
of the offset (O[x])), the value in the corresponding position in firstLevel (F[O[x]]) is the out-degree of the node x
. The number of elements about x
in firstLevel depends on the length of its posting list and on the distribution of the values in such list, since that impacts its partition. If the out-degree is 0, no other values are written for that node since this is all is needed to answer any query. The following value will be the out-degree of the next node.
If the posting list has at least one element, the next value is the lowerbound of the posting list of x, i.e. the first successor. Storing this value means saving the space of a whole first level block in some edge cases (e.g. a list of n
consecutive values where the first is much bigger than 0. The algorithm would probably suggest to have a first block of only the first element, to reduce the size of the universe, then the rest of the list as a second, cheap, block). Obviously this makes sense because we don't expect to see multigraphs (i.e. the successors list contains distinct elements) and because we know that in the web subsequences of consecutive successors are frequent.
Each subsequence identified by the approximation algorithm contains 2 or 3 elements in the firstLevel vector:
max - lowerbound
).The algorithms navigates the graph from the 0-indexed node and, for each, compresses the posting list. It starts writing the out-degree in firstLevel and its index in the offset. If the out-degree isn't 0, it then writes the first element and moves to analyse the successors list. The lowerbound of the subsequence will be the first element, in the case of the first block, and will be the maximum of the previous subsequence + 1 for each other subsequence. The method that analyses the list to find the partition returns a list of Partition instances, each including three values: the leftmost item (included in the sequence), the rightmost item (excluded) and the algorithm to use to compress this specific subsequence.
The bitstream contains the best compression we can achieve for the element of the sequence and the offset vector is a cost we can't do without. It's also already compressed, since each element is stored as the difference from the previous element, and saved as its Elias-γ compressed value.
The firstLevel is the vector that can be compressed further. The algorithm chosen to compress the values is γ encoding for each value, trying to minimise the values to save hence the number of bits needed for each element. As previously analysed, there are five values stored for each posting list: the node's out-degree, the list's lowerbound (i.e. the first element) and, for each chunk, the biggest element, the length of the subsequence and the index from which to read in the bistream. The following is how those could and have been better compressed:
i
often has neighbours close to i
. Hence, we can store i
's lowerbound as the gap from i
. Since this could also be negative, the method int2nat
is used, so that negative numbers can be mapped to positive values. This changed the distribution of lowerbound values, as the following pictures show.Absolute lowerbound distribution | Relative lowerbound distribution |
---|---|
Absolute max distribution | Relative max distribution |
---|---|
The graph is usually accessed in one of three ways:
The interface of ImmutableGraph wants a lazy interator for the successors of x but in this work the posting list was decompressed completely and a non-lazy iterator was returned instead. Selection and ranking are therefore run on the decompressed sequence.