mpetri / recursive_graph_bisection

This program implements the following graph reordering technique: Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, Alon Shalita: Compressing Graphs and Indexes with Recursive Graph Bisection. KDD 2016: 1535-1544
BSD 3-Clause "New" or "Revised" License
10 stars 5 forks source link

Recursive graph bisection

This program implements the following graph reordering technique:

Requirements

The program requires cilkplus which should be available in all newish compilers (e.g. gcc7)

Installing and Compiling

To install and compile the code run the following commands

git clone https://github.com/mpetri/recursive_graph_bisection.git

And compile using make:

make

which produces the rec_graph_bisect.x binary.

Input format

We use the ds2i input format (description taken from the repo):

A binary sequence is a sequence of integers prefixed by its length, where both the sequence integers and the length are written as 32-bit little-endian unsigned integers.

A collection consists of 3 files, <basename>.docs, <basename>.freqs, <basename>.sizes.

Running command

To reorder the index the following options are provided:

rec_graph_bisect.x <ds2i_prefix> <ds2i_out_prefix> <min_list_len> <num threads>

where

Example

Say you have stored gov2 in ds2i format described above in a directory:

[10:56:28 mpetri]$ ls -l /storage/gov2-d2si/
total 43084248
-rw-r--r-- 1 mpetri mpetri 21765004632 Jul 18 14:37 gov2.docs
-rw-r--r-- 1 mpetri mpetri 21765004624 Jul 18 14:37 gov2.freqs
-rw-r--r-- 1 mpetri mpetri    98831272 Jul 18 14:37 gov2.sizes

so the ds2i_prefix would be /storage/gov2-ds2i/gov2

and we execute the bisection command with the parameters

rec_graph_bisect.x /storage/gov2-ds2i/gov2 /storage/gov2-ds2i/gov2-bisected 256 32

which uses 32 threads, a minimum list length of 256 and stores the result as:

[10:56:28 mpetri]$ ls -l /storage/gov2-d2si/
total 43084248
-rw-r--r-- 1 mpetri mpetri 21765004632 Jul 18 14:37 gov2.docs
-rw-r--r-- 1 mpetri mpetri 21765004624 Jul 18 14:37 gov2.freqs
-rw-r--r-- 1 mpetri mpetri    98831272 Jul 18 14:37 gov2.sizes
-rw-r--r-- 1 mpetri mpetri 21765004632 Jul 18 18:32 gov2-bisected.docs
-rw-r--r-- 1 mpetri mpetri 21765004624 Jul 18 18:32 gov2-bisected.freqs
-rw-r--r-- 1 mpetri mpetri    98831272 Jul 18 18:32 gov2-bisected.sizes
-rw-r--r-- 1 mpetri mpetri    98831272 Jul 18 18:32 gov2-bisected.mapping

where gov2-bisected.mapping specifies how the document identifiers were remapped in the following format:

<initial id> <new id>

Runtime

The code is not as optimized as in the paper but finishes in reasonable time frame. For example, gov2 can be reordered in less than two hours.

However, memory consumption is quite high. It requires at least O(size of input files) RAM.

Authors

License

This project is licensed under the BSD 3-Clause License - see the LICENSE.md file for details