ldbc / ldbc_snb_datagen_spark

Synthetic graph generator for the LDBC Social Network Benchmark, running on Spark
https://ldbcouncil.org/benchmarks/snb
Apache License 2.0
166 stars 58 forks source link
snb

LDBC_LOGO

LDBC SNB Datagen (Spark-based)

Build Status

The LDBC SNB Data Generator (Datagen) produces the datasets for the LDBC Social Network Benchmark's workloads. The generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.

:scroll: If you wish to cite the LDBC SNB, please refer to the documentation repository.

:warning: There are two different versions of the Datagen:

For each commit on the main branch, the CI deploys freshly generated small data sets.

Quick start

Build the JAR

To assemble the JAR file with SBT, run:

sbt assembly

Install Python tools

Some of the build utilities are written in Python. To use them, you have to create a Python virtual environment and install the dependencies.

E.g. with pyenv and pyenv-virtualenv:

pyenv install 3.7.13
pyenv virtualenv 3.7.13 ldbc_datagen_tools
pyenv local ldbc_datagen_tools
pip install -U pip
pip install ./tools

If the environment already exists, activate it with:

pyenv activate

Running locally

The ./tools/run.py script is intended for local runs. To use it, download and extract Spark as follows.

Spark 3.2.x

Spark 3.2.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.2.x.

To place Spark under /opt/:

scripts/get-spark-to-opt.sh
export SPARK_HOME="/opt/spark-3.2.2-bin-hadoop3.2"
export PATH="${SPARK_HOME}/bin":"${PATH}"

To place it under ${HOME}/:

scripts/get-spark-to-home.sh
export SPARK_HOME="${HOME}/spark-3.2.2-bin-hadoop3.2"
export PATH="${SPARK_HOME}/bin":"${PATH}"

Both Java 8 and Java 11 are supported, but Java 17 is not (Spark 3.2.2 will fail, since it uses internal Java APIs and does not set the permissions appropriately).

Building the project

Run:

scripts/build.sh

Running the generator

Once you have Spark in place and built the JAR file, run the generator as follows:

export PLATFORM_VERSION=$(sbt -batch -error 'print platformVersion')
export DATAGEN_VERSION=$(sbt -batch -error 'print version')
export LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
./tools/run.py <runtime configuration arguments> -- <generator configuration arguments>

Runtime configuration arguments

The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:

./tools/run.py --help

To generate a single part-* file, reduce the parallelism (number of Spark partitions) to 1.

./tools/run.py --parallelism 1 -- --format csv --scale-factor 0.003 --mode bi

Generator configuration arguments

The generator configuration arguments allow the configuration of the output directory, output format, layout, etc.

To get a complete list of the arguments, pass --help to the JAR file:

./tools/run.py -- --help

To change the Spark configuration directory, adjust the SPARK_CONF_DIR environment variable.

A complex example:

export SPARK_CONF_DIR=./conf
./tools/run.py --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode bi --scale-factor 0.003

It is also possible to pass a parameter file:

./tools/run.py -- --format csv --param-file params.ini

Docker images

SNB Datagen images are available via Docker Hub. The image tags follow the pattern ${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}, e.g ldbc/datagen-standalone:0.5.0-2.12_spark3.2.

When building images ensure that you use BuildKit.

Standalone Docker image

The standalone image bundles Spark with the JAR and Python helpers, so you can run a workload in a container similarly to a local run, as you can see in this example:

export SF=0.003
mkdir -p out_sf${SF}_bi   # create output directory
docker run \
    --mount type=bind,source="$(pwd)"/out_sf${SF}_bi,target=/out \
    --mount type=bind,source="$(pwd)"/conf,target=/conf,readonly \
    -e SPARK_CONF_DIR=/conf \
    ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} \
    --parallelism 1 \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --generate-factors

The standalone Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

export PLATFORM_VERSION=$(sbt -batch -error 'print platformVersion')
export DATAGEN_VERSION=$(sbt -batch -error 'print version')
export DOCKER_BUILDKIT=1
docker build . --target=standalone -t ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}

JAR-only image

The ldbc/datagen-jar image contains the assembly JAR, so it can bundled in your custom container:

FROM my-spark-image
ARG VERSION
COPY --from=ldbc/datagen-jar:${VERSION} /jar /lib/ldbc-datagen.jar

The JAR-only Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

docker build . --target=jar -t ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}

Pushing to Docker Hub

To release a new snapshot version on Docker Hub, run:

docker tag ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} ldbc/datagen-jar:latest
docker push ldbc/datagen-jar:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}
docker push ldbc/datagen-jar:latest
docker tag ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION} ldbc/datagen-standalone:latest
docker push ldbc/datagen-standalone:${DATAGEN_VERSION/+/-}-${PLATFORM_VERSION}
docker push ldbc/datagen-standalone:latest

To release a new stable version, create a new Git tag (e.g. by creating a new release on GitHub), then build the Docker image and push it.

Elastic MapReduce

We provide scripts to run Datagen on AWS EMR. See the README in the ./tools/emr directory for details.

Graph schema

The graph schema is as follows:

Troubleshooting