Closed cpuwar closed 1 year ago
Hi @cpuwar, thanks for opening this issue!
The Datagen's interactive
mode is abandoned and we are moving to a model where we take the generated bi
data sets and post-process them. This is done by this script: https://github.com/ldbc/ldbc_snb_interactive_impls/blob/main/scripts/generate-all.sh
Later this spring we'll 1) deprecate the interactive mode in Datagen and 2) publish the final Interactive data sets. These are high on our roadmaps but we did not yet get to them.
Regarding your experiments, it may be worth considering using SNB Interactive v1 (v1-dev
branch). This workload is limited to scale factors <= 1000 and lacks delete operations (its updates only consist of inserts). However, it is a mature codebase, pre-generated data sets are available for Cypher, Postgres, and other systems. Moreover, for v1 implementation, audits can be commissioned. If you're interested in an audit, please send me an email (gabor.szarnyas ldbcouncil org
).
Gabor
Hi @szarnyasg Thank you for the kind explanation. Now I understand the reason clearly.
Blessings, Dongho.
Later this spring we'll 1) deprecate the interactive mode in Datagen and 2) publish the final Interactive data sets. These are high on our roadmaps but we did not yet get to them.
I performed 1) via this PR: https://github.com/ldbc/ldbc_snb_datagen_spark/pull/427
Thank you for the update reminder.
One more question: Currently, I am just testing each query in queries folder. When I ran interactive-complex-1.cypher on the graph in the neo4j container (created by https://github.com/ldbc/ldbc_snb_interactive_impls/tree/main/cypher), it failed by no Person node with an id of 4398046511333, which is set by the parameter in the query.
I searched for it in several outputs of scaleFactor of 1 like Hadoop output, datagen docker and even in several pre-generated datasets that were downloaded.
I could change the id to one that exists in the graph on the container to make the query work.
I am just wondering whether I made a mistake in getting the dataset or some queries in the repo are not updated to work well on the container since the current repo is still under development as you told.
The IDs in query header are there are placeholders -- they change from scale factor to scale factor. You need to generate them for each scale factor using the parameter generator (which is invoked by the generate-all.sh
script).
I finished running generate-all.sh. It returned an error at the end but it looks like it generated dataset. Can I now start the 2nd step(Loading the data) in the following link? https://github.com/ldbc/ldbc_snb_interactive_impls/tree/main/cypher
##### Generate Update Streams #####
${LDBC_SNB_DATA_ROOT_DIRECTORY}: /mnt/nvme/ldbc_snb_datagen_spark/out-sf1/
Traceback (most recent call last):
File "/mnt/nvme/ldbc_snb_interactive_driver/scripts/./create_update_streams.py", line 9, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
The error prevented the updates from being correctly converted.
Please install the required dependencies, including pandas
, using the driver's install-dependencies.sh
script.
This time it worked well. Can I now start the 2nd step(Loading the data) in the following link? https://github.com/ldbc/ldbc_snb_interactive_impls/tree/main/cypher
============ Done ============
/mnt/nvme/ldbc_snb_datagen_spark/out-sf1//factors/parquet/raw/composite-merged-fk//people4Hops/part-00000-ce51ba15-ad35-48d7-b920-ab7fd81b634f-c000.snappy.parquet
/mnt/nvme/ldbc_snb_datagen_spark/out-sf1//factors/parquet/raw/composite-merged-fk//people4Hops/curated_paths.parquet
/mnt/nvme/ldbc_snb_datagen_spark/out-sf1//factors/parquet/raw/composite-merged-fk//people4Hops/_SUCCESS
Loading factor tables from path /mnt/nvme/ldbc_snb_datagen_spark/out-sf1//factors/parquet/raw/composite-merged-fk/*
============ Loading the factor tables ============
Loading personNumFriendTags
Loading personNumFriends
Loading personFirstNames
Loading creationDayAndLengthCategoryNumMessages
Loading companyNumEmployees
Loading personNumFriendComments
Loading cityNumPersons
Loading lengthNumMessages
Loading countryPairsNumFriends
Loading creationDayNumMessages
Loading personDays
Loading sameUniversityConnected
Loading personStudyAtUniversityDays
Loading cityPairsNumFriends
Loading creationDayAndTagNumMessages
Loading tagClassNumTags
Loading tagNumPersons
Loading personKnowsPersonDays
Loading personNumFriendOfFriendPosts
Loading countryNumMessages
Loading tagClassNumMessages
Loading personLikesNumMessages
Loading people2Hops
Loading people4Hops
Loading personDisjointEmployerPairs
Loading personNumFriendsOfFriendsOfFriends
Loading personKnowsPersonConnected
Loading tagNumMessages
Loading languageNumPosts
Loading personWorkAtCompanyDays
Loading messageIds
Loading creationDayAndTagClassNumMessages
Loading personNumFriendOfFriendCompanies
Loading countryNumPersons
Loading personNumFriendOfFriendForums
============ Factor Tables loaded ============
Threshold updated from 2 to 2 for table personNumFriendsOfFriendsOfFriends
Threshold updated from 5 to 5 for table personNumFriendsOfFriendsOfFriends
Threshold updated from 1000 to 1000 for table personNumFriendsOfFriendsOfFriends
Threshold updated from 500 to 500 for table creationDayNumMessages
Threshold updated from 5 to 5 for table personFirstNames
============ Generate 13b and 14b parameters ============
============ Done ============
============ Generate parameters Q1 - Q13 ============
Start time of initial_snapshot: 2012-11-29 02:52:48+00:00
End time of initial_snapshot: 2013-01-01 00:00:00+00:00
Time bucket size: 1 day, 0:00:00
============ Done ============
============ Export parameters to parquet files ============
============ Output parameters ============
- Q1 TO ../parameters/interactive-1.parquet
- Q2 TO ../parameters/interactive-2.parquet
- Q3a TO ../parameters/interactive-3a.parquet
- Q3b TO ../parameters/interactive-3b.parquet
- Q4 TO ../parameters/interactive-4.parquet
- Q5 TO ../parameters/interactive-5.parquet
- Q6 TO ../parameters/interactive-6.parquet
- Q7 TO ../parameters/interactive-7.parquet
- Q8 TO ../parameters/interactive-8.parquet
- Q9 TO ../parameters/interactive-9.parquet
- Q10 TO ../parameters/interactive-10.parquet
- Q11 TO ../parameters/interactive-11.parquet
- Q12 TO ../parameters/interactive-12.parquet
- Q13a TO ../parameters/interactive-13a.parquet
- Q13b TO ../parameters/interactive-13b.parquet
- Q14a TO ../parameters/interactive-14a.parquet
- Q14b TO ../parameters/interactive-14b.parquet
============ Parameters exported ============
============ Done ============
Total Parameter Generation Duration: 7.5540 seconds
============ Generate short read debug parameters ============
============ Generate Short Query Parameters ============
- QpersonId TO ../parameters/interactive-personId.parquet
- QmessageId TO ../parameters/interactive-messageId.parquet
============ Short Query Parameters exported ============
============ Done ============
Yes! You have all the necessary input data now.
Thanks a lot. You look like an angle God sent.
God bless you. Have a wonderful weekend! Dongho.
Sorry, but one more thing: I finished running scripts/load-in-one-step.sh successfully. However, when I ran driver/create-validation-parameters.sh after updating driver/create-validation-parameters.properties I got an error like this:
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
21:55:15.676 [main] INFO Client - Client terminated unexpectedly
org.ldbcouncil.snb.driver.ClientException: Error loading Workload class: org.ldbcouncil.snb.driver.workloads.interactive.LdbcSnbInteractiveWorkload
at org.ldbcouncil.snb.driver.client.CreateValidationParamsMode.init(CreateValidationParamsMode.java:69)
at org.ldbcouncil.snb.driver.Client.main(Client.java:63)
Caused by: org.ldbcouncil.snb.driver.WorkloadException: Read operation parameters file does not exist: /mnt/nvme/ldbc_snb_interactive_impls/cypher/../parameters/interactive-3b.parquet
at org.ldbcouncil.snb.driver.workloads.interactive.LdbcSnbInteractiveWorkload.onInit(LdbcSnbInteractiveWorkload.java:148)
at org.ldbcouncil.snb.driver.Workload.init(Workload.java:56)
at org.ldbcouncil.snb.driver.client.CreateValidationParamsMode.init(CreateValidationParamsMode.java:65)
@cpuwar Interesting. The output in your previous comment shows that the script produced the file interactive-3b.parquet
:
============ Output parameters ============
- Q1 TO ../parameters/interactive-1.parquet
- Q2 TO ../parameters/interactive-2.parquet
- Q3a TO ../parameters/interactive-3a.parquet
- Q3b TO ../parameters/interactive-3b.parquet
So... is this file indeed there under /mnt/nvme/ldbc_snb_interactive_impls/cypher/../parameters/
?
I got nothing but README.md
dh@bigroom:/mnt/nvme/ldbc_snb_interactive_impls/cypher$ ls /mnt/nvme/ldbc_snb_interactive_impls/cypher/../parameters/
README.md
Locate the .parquet
files generated by the generate-all.sh
script, then copy them to the parameters/
directory.
They are here:
dh@bigroom:/mnt/nvme/ldbc_snb_interactive_impls$ find . | grep parquet
./parameters-sf1/interactive-13b.parquet
./parameters-sf1/interactive-4.parquet
./parameters-sf1/interactive-11.parquet
./parameters-sf1/interactive-10.parquet
./parameters-sf1/interactive-9.parquet
./parameters-sf1/interactive-7.parquet
./parameters-sf1/interactive-3a.parquet
./parameters-sf1/interactive-2.parquet
./parameters-sf1/interactive-6.parquet
./parameters-sf1/interactive-14a.parquet
./parameters-sf1/interactive-14b.parquet
./parameters-sf1/interactive-12.parquet
./parameters-sf1/interactive-messageId.parquet
./parameters-sf1/interactive-5.parquet
./parameters-sf1/interactive-personId.parquet
./parameters-sf1/interactive-13a.parquet
./parameters-sf1/interactive-8.parquet
./parameters-sf1/interactive-1.parquet
./parameters-sf1/interactive-3b.parquet
./cypher/social-network-sf0.003-bi-composite-projected-fk-neo4j-compressed-epoch-millis/graphs/parquet
./cypher/social-network-sf0.003-bi-composite-projected-fk-neo4j-compressed-epoch-millis/graphs/parquet/raw
./cypher/social-network-sf0.003-bi-composite-projected-fk-neo4j-compressed-epoch-millis/graphs/parquet/raw/composite-merged-fk
./cypher/social-network-sf0.003-bi-composite-projected-fk-neo4j-compressed-epoch-millis/graphs/parquet/raw/composite-merged-fk/static
./cypher/social-network-sf0.003-bi-composite-projected-fk-neo
And here:
./update-streams-sf1/deletes/Forum.parquet
./update-streams-sf1/deletes/Person_knows_Person.parquet
./update-streams-sf1/deletes/Post.parquet
./update-streams-sf1/deletes/Forum_hasMember_Person.parquet
./update-streams-sf1/deletes/Person_likes_Post.parquet
./update-streams-sf1/deletes/Person_likes_Comment.parquet
./update-streams-sf1/deletes/Comment.parquet
./update-streams-sf1/deletes/Person.parquet
./update-streams-sf1/inserts/Forum.parquet
./update-streams-sf1/inserts/Person_knows_Person.parquet
./update-streams-sf1/inserts/Post.parquet
./update-streams-sf1/inserts/Forum_hasMember_Person.parquet
./update-streams-sf1/inserts/Person_likes_Post.parquet
./update-streams-sf1/inserts/Person_likes_Comment.parquet
./update-streams-sf1/inserts/Comment.parquet
./update-streams-sf1/inserts/Person.parquet
Ah, okay. So, in order to run the driver using the "create validation mode" on scale factor 1, it's recommended to base your properties file on driver/create-validation-parameters-sf1.properties
which has the correct paths.
So, for SF1: create-validation-parameters-sf1.properties validate.properties benchmark-sf1.properties
Right. What is your goal? If you want to generate the expected output, use create-validation-parameters-sf1.properties
. If you want to benchmark a system with the SF1 data set, use benchmark-sf1.properties
. This will require fine-tuning the setting.
Another error after using create-validation-parameters-sf1.properties:
dh@bigroom:/mnt/nvme/ldbc_snb_interactive_impls/cypher$ driver/create-validation-parameters.sh
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
22:34:56.875 [main] INFO CreateValidationParamsMode - Loaded Workload: org.ldbcouncil.snb.driver.workloads.interactive.LdbcSnbInteractiveWorkload
Unable to load query from file: queries//interactive-complex-3-duration-as-function.cypherUnable to load query from file: queries//interactive-complex-4-duration-as-function.cypherUnable to load query from file: queries//interactive-complex-7-with-second.cypherUnable to load query from file: queries//interactive-update-1-add-person.cypherUnable to load query from file: queries//interactive-update-1-add-person-companies.cypherUnable to load query from file: queries//interactive-update-1-add-person-emails.cypherUnable to load query from file: queries//interactive-update-1-add-person-languages.cypherUnable to load query from file: queries//interactive-update-1-add-person-tags.cypherUnable to load query from file: queries//interactive-update-1-add-person-universities.cypherUnable to load query from file: queries//interactive-update-4-add-forum.cypherUnable to load query from file: queries//interactive-update-4-add-forum-tags.cypherUnable to load query from file: queries//interactive-update-6-add-post.cypherUnable to load query from file: queries//interactive-update-6-add-post-tags.cypherUnable to load query from file: queries//interactive-update-7-add-comment.cypherUnable to load query from file: queries//interactive-update-7-add-comment-tags.cypherUnable to load query from file: queries//interactive-update-7-add-comment-weight.cypherUnable to load query from file: queries//interactive-update-7-add-comment-edge.cypherUnable to load query from file: queries//interactive-update-6-add-post-content.cypherUnable to load query from file: queries//interactive-update-6-add-post-imagefile.cypherException in thread "main" java.lang.UnsupportedClassVersionError: org/neo4j/driver/AuthTokens has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.ldbcouncil.snb.impls.workloads.cypher.CypherDbConnectionState.<init>(CypherDbConnectionState.java:26)
at org.ldbcouncil.snb.impls.workloads.cypher.CypherDb.onInit(CypherDb.java:30)
at org.ldbcouncil.snb.impls.workloads.cypher.interactive.CypherInteractiveDb.onInit(CypherInteractiveDb.java:16)
at org.ldbcouncil.snb.driver.Db.init(Db.java:34)
at org.ldbcouncil.snb.driver.client.CreateValidationParamsMode.init(CreateValidationParamsMode.java:77)
at org.ldbcouncil.snb.driver.Client.main(Client.java:63)
This is a standard Java error:
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/neo4j/driver/AuthTokens has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0
You should upgrade to Java 17 (Java SE 17 = 61 (0x3D hex)
, see https://en.wikipedia.org/wiki/Java_class_file).
It was same after I did: export JAVA_HOME=/mnt/nvme/amazon-corretto-17.0.5.8.1-linux-x64/
Make sure this is also on the PATH
. The command locate java
should point to a binary whose java -version
output is version 17.
It still returned the same error after I did: export JAVA_HOME=/mnt/nvme/amazon-corretto-17.0.5.8.1-linux-x64/ export PATH=${PATH}:${JAVA_HOME}/bin
Do I need to set java path somewhere in your setting? And I suspect in the path of queries in the error log. They have "//" not "/" in the path:
dh@bigroom:/mnt/nvme/ldbc_snb_interactive_impls/cypher$ driver/create-validation-parameters.sh WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 22:34:56.875 [main] INFO CreateValidationParamsMode - Loaded Workload: org.ldbcouncil.snb.driver.workloads.interactive.LdbcSnbInteractiveWorkload Unable to load query from file: queries//interactive-complex-3-duration-as-function.cypherUnable to load query from file: queries//interactive-complex-4-duration-as-function.cypherUnable to load query from file: queries//interactive-complex-7-with-second.cypherUnable to load query from file: queries//interactive-update-1-add-person.cypherUnable to load query from file: queries//interactive-update-1-add-person-companies.cyph
I changed "queryDir=queries" instead of "queryDir=queries\". But still getting Unable to load query errors since the file names are difference in "queries" folder:
Unable to load query from file: queries/interactive-complex-3-duration-as-function.cypherUnable to load query from file: queries/interactive-complex-4-duration-as-function.cypherUnable to load query from file: queries/interactive-complex-7-with-second.cypherUnable to load query from file: queries/interactive-update-1-add-person.cypherUnable to load query from file: queries/interactive-update-1-add-person-companies.cypherUnable to load query from file: queries/interactive-update-1-add-person-emails.cypherUnable to load query from file: queries/interactive-update-1-add-person-languages.cypherUnable to load query from file: queries/interactive-update-1-add-person-tags.cypherUnable to load query from file: queries/interactive-update-1-add-person-universities.cypherUnable to load query from file: queries/interactive-update-4-add-forum.cypherUnable to load query from file: queries/interactive-update-4-add-forum-tags.cypherUnable to load query from file: queries/interactive-update-6-add-post.cypherUnable to load query from file: queries/interactive-update-6-add-post-tags.cypherUnable to load query from file: queries/interactive-update-7-add-comment.cypherUnable to load query from file: queries/interactive-update-7-add-comment-tags.cypherUnable to load query from file: queries/interactive-update-7-add-comment-weight.cypherUnable to load query from file: queries/interactive-update-7-add-comment-edge.cypherUnable to load query from file: queries/interactive-update-6-add-post-content.cypherUnable to load query from file: queries/interactive-update-6-add-post-imagefile.cypher
Unable to load query from file:
This is just a warning, just ignore it.
And I suspect in the path of queries in the error log. They have "//" not "/" in the path:
The //
in the path makes no difference, operating systems just interpret it as /
.
Do I need to set java path somewhere in your setting?
No. But my recommendation would be to use the SDKman program to help switch between Java installations.
It worked well after installing jdk 17 with sdkman. Thanks a lot.
########### LdbcQuery4
########### LdbcQuery1
########### LdbcQuery4
########### LdbcInsert4AddForum
########### LdbcShortQuery1PersonProfile
########### LdbcInsert4AddForum
########### LdbcShortQuery1PersonProfile
########### LdbcInsert4AddForum
########### LdbcShortQuery1PersonProfile
23:32:27.747 [main] INFO CreateValidationParamsMode - Successfully generated 172 database validation parameters
Great! Have a good weekend you too.
I am trying to test queries at this link: https://github.com/ldbc/ldbc_snb_interactive_impls/tree/main/cypher/queries
As I understand it, there are two modes in ldbc_snb_datagen_spark repo: bi and interactive. I believe that the interactive mode generate the dataset for the queries in the link above. So, I generated a dataset by modifying the command described at https://github.com/ldbc/ldbc_snb_interactive_impls/tree/main/cypher like this: tools/run.py --cores 24 --memory 100g – --mode interactive --format csv --scale-factor 1 --output-dir out-sf1/ --explode-edges --epoch-millis --format-options header=false,quoteAll=true
Q1. Am I correct in thinking that I need to generate dataset with interactive mode for testing queries above?
Then, I tried to import it using "scripts/load-in-one-step.sh", but got errors like find: ‘/mnt/nvme/ldbc_snb_datagen_spark/out-sf1/graphs/csv/interactive/composite-projected-fk/initial_snapshot/static/Place’: No such file or directory find: ‘/mnt/nvme/ldbc_snb_datagen_spark/out-sf1/graphs/csv/interactive/composite-projected-fk/initial_snapshot/static/Organisation’: No such file or directory ... I discovered that there was no initial_snapshot folder in the output of interactive mode.
Q2. Is it normal? I just need to add initial_snapshot folder in output folder or remove all "initial_snapshot" in import.sh? (https://github.com/ldbc/ldbc_snb_interactive_impls/blob/fb1bf3b79d9aca5a4dd3262e2622bd730615e78f/cypher/scripts/import.sh)
When I tried running scripts/load-in-one-step.sh after adding "initial_snapshot" folder and moving dynamic and static folders into it, I encountered the following errors: