thunderain-project / examples

A SimRank algorithm implementation using Spark
Other
49 stars 21 forks source link

Spark SimRank Algorithm Implementation

This package includs 5 different SimRank implementations: DFS (depth-first search) MapReduce, naive MapReduce, delta MapReduce, matrix multiplication and PageRank-like Random Walk with Restart. You can choose different implementation through configuration.

This implementation is compatible with Spark 0.8.1+ version, you can compile using sbt assembly, before that please configure the correct Hadoop version in build.sbt.

How to Run

  1. Using graph_generate.py to generate random adjacency matrix, you can configure GRAPH_SIZE (number of vertices), EDGE_SIZE (number of edges) to control the matrix rank, this script will serialize matrix to file.
  2. Generate initial similarity matrix. Using ./run simrank.SimRankDataPrepare to generate data, it should be noted that two parameters graphASize and graphBSize, which specifies the vertices number of two sub-graphs in the bipartite graph, should be the same as step 1's generated result.
  3. Configure config/config.properties and run by ./run simrank.SimRankImpl.

Notes


This implementation is open sourced under Apache License Version 2.0.