holderlb / graph-stream-generator

The Graph Stream Generator (GSG) generates multiple streams of graph vertices and edges according to subgraph patterns that can be partitioned across different streams.
MIT License
9 stars 2 forks source link
generator graph graphstream json multiple-streams network python stream stream-generation

Graph Stream Generator

The Graph Stream Generator (GSG) generates multiple streams of graph vertices and edges according to subgraph patterns that can be partitioned across different streams.

Authors: Dr. Larry Holder, School of Electrical Engineering and Computer Science, Washington State University, email: holder@wsu.edu. Sumit Purohit, Pacific Northwest National Laboratory, email: sumit.purohit@pnnl.gov.

Support: This material is based upon work supported by the National Science Foundation under Grant No. 1646640.

Running

To run GSG and generate output files:

python3 gsg.py <inputFile.json>

To convert GSG output graphs to GraphML:

python3 gExportGraphML.py <graphFile.json>

Input File

The input file is in JSON format. An example is in the file input.json. There are a few required global parameters to the Graph Stream Generator (GSG).

Pattern

A pattern describes a subgraph (set of vertices and edges) that are probabilistically-added to the graph streams. Vertices can be new, or drawn from earlier in the stream. Edges are assigned to a specific stream and scheduled according to a uniform offset range from the initial time unit when the pattern is chosen. Each vertex and edge can have a set of attribute-value pairs and an optional type. Specifically, a pattern consists of the following properties (all required):

Vertex

A vertex appearing in the vertices array of a pattern consists of the following properties in a JSON object.

A new vertex is written to a stream just before the earliest edge that involves this vertex is written to the stream. If edges assigned to different streams connect to the same vertex, then that same vertex is written to each stream.

Edge

An edge appearing in the edges array of a pattern consists of the following properties in a JSON object.

Each edge in a pattern can be assigned to a different stream, except that edges connected to a non-new vertex must all be assigned to the same stream. Using this technique, a pattern can be divided up across multiple streams. This is one of the main goals of GSG, that is, to provide test data to see if a graph mining system can find the full pattern by analyzing (or fusing) the individual streams. In terms of fusion, the streams can be easily fused together into one large graph, using the vertex ids as anchors. That is, two vertices from two different streams having the same id, represent the same vertex (or entity).

In the event that vertices and edges are scheduled to appear beyond the duration of the stream generation, stream generation will continue until all scheduled vertices and edges are written to streams. No new patterns are triggered beyond the duration of the stream generation.

Output Stream Files

A file named outputFilePrefix-sN is created for each stream 1 to N. Each stream file contains a JSON array of vertex and edge instances, as described below.

Vertex Instance

A vertex instance is a JSON object with name "vertex" and whose value is a JSON object with the following properties.

Edge Instance

An edge instance is a JSON object with name "edge" and whose value is a JSON object with the following properties.

Output Instances File

A single file named outputFilePrefix-insts is created that contains a JSON array of pattern instances for all tracked patterns. Each pattern instance is a JSON object with the following properties.

The instances file provides the ground truth of all the full patterns that appear across all the graph streams.

Questions?

Contact: Dr. Larry Holder, School of Electrical Engineering and Computer Science, Washington State University, email: holder@wsu.edu.