Each community is comprised of several key players which have an impact on its overall behavior and play critical roles. Detecting what these players are is critical in various domains such as identifying key individuals or places when tracking criminal and terrorist activities. Our algorithm is able to quickly uncover the top n influential nodes located at major communication cross points inside a given community. It is designed to work on distributed environments making it suited for processing geographically spread data found in cloud systems.
The objective of the algorithm is to enable fast identification of top n nodes in a distributed sparse graph.
It is beneficial to data scientists seeking fast identification of top central players in large geographically distributed graph data.
The algorithm has been tested using MPI on an HPC cluster with a varying number of nodes ranging from 4 to 64 with 2 workers per node, each worker with 8 GB memory. Each node consists of two Quad-core AMD Opteron 2376 2.3 GHz processors. Speed-ups of up to 12x have been observed for synthetic and real-life sparse graphs up to 1M vertices and 3M edges.
To use the algorithm on a given data set:
Required
Familiarity with Linux
Manipulating graph data (potentially to convert the given data to the Metis graph format)
Ability to run MPI programs on a single node or cluster depending on graph size
Good to have
To use the quick start guide, familiarity with Virtual Box and virtualization
C++
MPI Library
To setup the environment on a cluster
Cluster administration knowledge
Setting up MPI environment on cluster
The algorithm is hosted at the git hub repository at https://github.com/usc-cloud/parallel-high-betweenness-centrality
Clone the repository using
git clone https://github.com/usc-cloud/parallel-high-betweenness-centrality.git
Note: You may need to install a git client to download the repository.
A quick start guide can be found here together with a precompiled VM to help you get started.
A detailed guide on how to install the software on a distributed setup can be found here.
Given the increasing popularity of cloud oriented graph analytics frameworks a GoFFish version of the proposed algorithm will be delivered.