Adopt SF1000+ data sets

ldbc / ldbc_snb_interactive_v1_impls

Reference implementations for LDBC Social Network Benchmark's Interactive workload.

https://ldbcouncil.org/benchmarks/snb-interactive

Apache License 2.0

100 stars 86 forks source link

Adopt SF1000+ data sets #173

Open szarnyasg opened 3 years ago

szarnyasg commented 3 years ago

The SNB Interactive benchmark is currently limited to:

Data sets up to SF1000
Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (https://github.com/ldbc/ldbc_snb_datagen/issues/219).
The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

arvindshmicrosoft commented 3 years ago

@szarnyasg I have a question - the main part of the datagen itself would still scale to SF1000+, correct? Other than param generation and associated breaking changes described above?

szarnyasg commented 3 years ago

@arvindshmicrosoft unfortunately, it doesn't - I tried running it for SF3000 (with a numPerson number that would yield ~3TB of data) but it crashed with an NPE.

chinyajie commented 1 month ago

Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?

Thank you!

The SNB Interactive benchmark is currently limited to:

Data sets up to SF1000

Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.

The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).

The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

szarnyasg commented 1 month ago

Hi @chinyajie,

Thanks for reaching out. Indeed the Hadoop-based generator is limited to SF1000 and Interactive v2 is still under development. We can attempt to increase the range of supported data sets for Interactive v1 to SF3000 if there is an interest in getting audits for these data sets. Are you interested in obtaining audited results for Interactive v1 ? If so, please reach out to @.***

Gabor

On Thu, Oct 17, 2024 at 10:51 AM chinyajie @.***> wrote:

Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?

Thank you!

The SNB Interactive benchmark is currently limited to:

Data sets up to SF1000

Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload. Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.

The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219 https://github.com/ldbc/ldbc_snb_datagen_spark/issues/219).

The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput). Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_interactive_v1_impls/issues/173#issuecomment-2418944163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM . You are receiving this because you modified the open/close state.Message ID: @.***>

szarnyasg commented 3 weeks ago

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.

sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash

sdk install java 8.0.422.fx-zulu

cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz

export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh

I generated a data set with the following settings:

Instance: r6a.48xlarge (1.5 TB RAM)
Serializer: CsvBasic serializers
numPersons value: 9800000
This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:

The generation took ~80 hours (!).
According to /usr/bin/time -v, the maximum memory used was 1186 GB.
The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).
The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.

The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).

$ du -hd0 social_network/updateStream*.csv
745G    social_network/updateStream_0_0_forum.csv
302M    social_network/updateStream_0_0_person.csv

du -hd0 social_network/{static,dynamic}
2.2M    social_network/static
2.8T    social_network/dynamic

In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

chinyajie commented 2 weeks ago

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.
sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh
I generated a data set with the following settings:

Instance: r6a.48xlarge (1.5 TB RAM)

Serializer: CsvBasic serializers

numPersons value: 9800000

This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:
The generation took ~80 hours (!).

According to /usr/bin/time -v, the maximum memory used was 1186 GB.

The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).

The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.
The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).
$ du -hd0 social_network/updateStream*.csv
745G    social_network/updateStream_0_0_forum.csv
302M    social_network/updateStream_0_0_person.csv
du -hd0 social_network/{static,dynamic}
2.2M    social_network/static
2.8T    social_network/dynamic
In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

I apologize for the delayed response. Thank you for sharing your detailed experience and configuration for generating SF3000 datasets. I conducted a similar test and successfully generated SF3000 scale datasets by increasing the swap memory. This has been very helpful for my research.