Open szarnyasg opened 3 years ago
@szarnyasg I have a question - the main part of the datagen itself would still scale to SF1000+, correct? Other than param generation and associated breaking changes described above?
@arvindshmicrosoft unfortunately, it doesn't - I tried running it for SF3000 (with a numPerson number that would yield ~3TB of data) but it crashed with an NPE.
Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?
Thank you!
The SNB Interactive benchmark is currently limited to:
- Data sets up to SF1000
- Append-only workloads without deletions
These could be amended by backporting the improvements made for the BI workload.
Larger data sets
Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:
- The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
- The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).
- The inserts generated by the new data generator (e.g.
inserts/dynamic/Person/part-*.csv
) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)Introducing deletions
Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the
deletes/dynamic/Person/part-*.csv
files work well, maybe anupdateStream
-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).Timeline
These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.
Hi @chinyajie,
Thanks for reaching out. Indeed the Hadoop-based generator is limited to SF1000 and Interactive v2 is still under development. We can attempt to increase the range of supported data sets for Interactive v1 to SF3000 if there is an interest in getting audits for these data sets. Are you interested in obtaining audited results for Interactive v1 ? If so, please reach out to @.***
Gabor
On Thu, Oct 17, 2024 at 10:51 AM chinyajie @.***> wrote:
Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?
Thank you!
The SNB Interactive benchmark is currently limited to:
- Data sets up to SF1000
- Append-only workloads without deletions
These could be amended by backporting the improvements made for the BI workload. Larger data sets
Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:
- The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
- The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219 https://github.com/ldbc/ldbc_snb_datagen_spark/issues/219).
- The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)
Introducing deletions
Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput). Timeline
These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_interactive_v1_impls/issues/173#issuecomment-2418944163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM . You are receiving this because you modified the open/close state.Message ID: @.***>
I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.
sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh
I generated a data set with the following settings:
r6a.48xlarge
(1.5 TB RAM)CsvBasic
serializersnumPersons
value: 9800000
Results:
/usr/bin/time -v
, the maximum memory used was 1186 GB.The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign
serializer which results in more compact files).
$ du -hd0 social_network/updateStream*.csv
745G social_network/updateStream_0_0_forum.csv
302M social_network/updateStream_0_0_person.csv
du -hd0 social_network/{static,dynamic}
2.2M social_network/static
2.8T social_network/dynamic
In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.
I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.
sudo apt install zip unzip maven silversearcher-ag python2 fzf wget curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/ wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT} export HADOOP_HOME=`pwd`/hadoop-3.2.1 export HADOOP_CLIENT_OPTS="-Xmx1530G" /usr/bin/time -v ./run.sh
I generated a data set with the following settings:
- Instance:
r6a.48xlarge
(1.5 TB RAM)- Serializer:
CsvBasic
serializersnumPersons
value:9800000
- This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).
Results:
- The generation took ~80 hours (!).
- According to
/usr/bin/time -v
, the maximum memory used was 1186 GB.- The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).
- The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.
The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the
CsvMergeForeign
serializer which results in more compact files).$ du -hd0 social_network/updateStream*.csv 745G social_network/updateStream_0_0_forum.csv 302M social_network/updateStream_0_0_person.csv
du -hd0 social_network/{static,dynamic} 2.2M social_network/static 2.8T social_network/dynamic
In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.
I apologize for the delayed response. Thank you for sharing your detailed experience and configuration for generating SF3000 datasets. I conducted a similar test and successfully generated SF3000 scale datasets by increasing the swap memory. This has been very helpful for my research.
The SNB Interactive benchmark is currently limited to:
These could be amended by backporting the improvements made for the BI workload.
Larger data sets
Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:
inserts/dynamic/Person/part-*.csv
) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)Introducing deletions
Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the
deletes/dynamic/Person/part-*.csv
files work well, maybe anupdateStream
-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).Timeline
These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.