Open aMahanna opened 2 years ago
Hi @aMahanna,
Two things:
The datagen is deterministic, so the graphs (including the IDs) should be the same between different generators. Therefore, combining files from data sets should be possible without getting inconsistency.
i. Using the usual UNIX tools like grep
, cat
, cut
, etc. These work well for splitting files.
ii. Using DuckDB. This approach also allows joins, aggregation (string_agg
) or unwinding (unnest
). I have a couple of example scripts for SNB BI: https://github.com/ldbc/ldbc_snb_example_data/tree/main/export
Gabor
Can you please clarify how the results are inconsistent?
Apologies for the delay and for the confusion, after further investigation we discovered a formatting mistake on our part when combining the files.
I will follow up shortly regarding your second point, but for now I just want to say thank you for all the help so far
Hi Gabor,
We've been evaluating the various SNB datasets available in attempt to support our database's multi-model functionality.
We found that using a combination of the Basic & MergeForeign datasets substantially increases our query performance and better suits our data model. Our request would be to have the datagen natively support the data model outlined below, or suggest a way to do so if it already exists. As it stands now, modelling the data in this way requires a lot of pre/post processing (as suggested above), which we believe will count against us if we were to have the benchmark audited.
In particular, we have situations where a query benefits from the Basic dataset (IC8), a query that benefits from the MergeForeign dataset (IC3 Sub-Query A), and another query that benefits from a combination of both (IC3 Sub-Query B).
Understanding that you may not be familiar with AQL (Arango Query Language), this query relies on the edge relationships only available in the Basic dataset (e.g post_hasCreator_person
, comment_hasCreator_person
, etc.).
FOR commentReply IN 2..2 INBOUND @personId post_hasCreator_person, comment_hasCreator_person, comment_replyOf_post, comment_replyOf_comment
SORT commentReply.creationDate DESC, commentReply._id
LIMIT 20
FOR creator IN 1..1 OUTBOUND commentReply comment_hasCreator_person
RETURN {
id: creator._id,
firstName: creator.firstName,
lastName: creator.lastName,
commentId: commentReply._id,
commentCreationDate: commentReply.creationDate,
commentContent: commentReply.content
}
The alternative approach is to solely rely on the MergeForeign attributes (i.e creator
, replyOfPost
, replyOfComment
). Seeing that none of the edge relationships mentioned above are included in MergeForeign, switching to these attributes would result in a query performance that is 6x slower than the current implementation. On the other hand, sticking to a Basic-only data model poses its own challenges, as seen below.
We've noticed peak performance in IC3 when a combination of Basic SNB edge relationships & MergeForeign SNB attributes are used within the same query.
A portion of IC3 relies on the person.place
MergeForeign attribute for efficient query performance.
FOR friend IN 1..2 ANY @personId person_knows_person OPTIONS {bfs: true, uniqueVertices:"global"}
FILTER friend.place NOT IN [countryXKey, countryYKey]
RETURN {id: friend.id, place: friend.place}
Attempting to do this using the Basic SNB person_isLocatedIn_place
edge relationship results in a query performance that is 70x slower.
Another portion of IC3 relies on the post.place
and the comment.place
MergeForeign attributes, while also benefitting from the post_hasCreator_person
and comment_hasCreator_person
relationships (found only in the Basic SNB dataset).
FOR message IN 1..1 INBOUND friend post_hasCreator_person,comment_hasCreator_person
FILTER message.place IN [countryXKey, countryYKey]
RETURN message
Attempting to do this using the Basic SNB post_isLocatedIn_place
& comment_isLocatedIn_place
edge relationships results in a query performance that is 30x slower.
As far as we can tell, the current datagen utility doesn't support this, and so we feel that this leaves out the multi-model graph capabilities offered by our database. We are not looking to manipulate the data in a way that specifically favours us, but instead looking for the LDBC datagen to better support the functionality of multi-model graph databases.
Would it be possible to have the datagen support this data model out of the box (assuming it doesn't already)?
@aMahanna I transferred the issue to the (new, Spark-based) Datagen's repository. I skimmed your suggestion and it seems doable in Datagen albeit it will not have a high priority in our development plans.
This week I'm travelling/have other duties -- I will take a look next week.
Hello again,
The bad news: this functionality is unlikely to be supported in the Datagen.
The good news: I have generated the data sets and uploaded them to Cloudflare R2 (an egress-free object storage):
Gabor
Hi @szarnyasg
Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to?
Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets?
Thank you again for all the help so far!
Chris
Hi,
It’s the same schema as before but R2 is (slightly) faster than SURF.
Sure, we are open for reviewing PRs in the Datagen.
Gabor
On Tue, 18 Oct 2022 at 21:27, Chris Woodward @.***> wrote:
Hi @szarnyasg https://github.com/szarnyasg
Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to?
Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets?
Thank you again for all the help so far!
Chris
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1282899783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMPGTJKYVWHHXLLX6S3WD322VANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
By the way, maybe an important piece of information that's missing from the discussion above: systems can pre-process the data set before loading. So you can take e.g. the composite merge foreign CSV files, run them through a script (which can use anything cut, Perl scripts, DuckDB SQL script, etc.) and create a new set of CSV files, then load those into the system-under-test. We try to avoid this in the reference implementations but it is definitely a possibility.
Hi Gabor @szarnyasg
Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:
/*stdin*\ : Read error (39) : premature end
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
The command I ran was the following:
tar --use-compress-program=unzstd -xvf bi-sf1000-composite-projected-fk.tar.zst.000
I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions?
Hi Chris,
Use cat + tar + unztsd: https://github.com/ldbc/auditing-tools/blob/main/cloudflare-r2.md#recombining-and-decompressing-data-sets For this, you'll need the 000, 001, etc. files in the same location.
Gabor
On Tue, Nov 15, 2022 at 4:55 PM Chris Woodward @.***> wrote:
Hi Gabor @szarnyasg https://github.com/szarnyasg
Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:
/stdin\ : Read error (39) : premature end tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now
The command I ran was the following: tar --use-compress-program=unzstd -xvf bi-sf1000-composite-projected-fk.tar.zst.000
I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions?
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1315976787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMNQA3CLW2ZXTOEKVZDWIQIHHANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
Hi Gabor,
I am unable to access that link, it shows 404.
Chris
Oops, I linked to a private repo :). This is its public counterpart:
On Tue, Nov 15, 2022 at 9:02 PM Chris Woodward @.***> wrote:
Hi Gabor,
I am unable to access that link, it shows 404.
Chris
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1316234542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMPE6FRNZGT4VICU6X3WIRFE7ANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
Thank you, that did the trick!
Hi Gabor,
Do you, by chance, have the IC substitution parameters used for the datasets you shared above? I found this: https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters but that only has the bi parameters.
To confirm, while this is tagged with bi
I assumed the initial_snapshot
would work for the IC queries as well, is this true?
Hi Chris,
We are working on tuning the parameter generator for the new Interactive workload.
Currently your best bet would be to download the factor tables from [1] and run them through the Interactive v2 driver’s paramgen at [2]. These will give valid parameters for queries on the initial snapshot.
The final version of the paramgen for Interactive v2 will produce parameters bucketed by days (in the network’s simulation time) and will be better calibrated to ensure stable runtimes (i.e. the runtimes will follow a Gaussian distribution more closely). This is worked on but still a few weeks away at the moment.
Best, Gabor PS: Most of the SNB task force is currently busy with other tasks / on holiday, so there will be some delay in answering issues in the coming week.
[1] https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#factor-tables [2] https://github.com/ldbc/ldbc_snb_interactive_driver/tree/main/paramgen
On Wed, 23 Nov 2022 at 21:18, Chris Woodward @.***> wrote:
Hi Gabor,
Do you, by chance, have the IC substitution parameters used for the datasets you shared above? I found this: https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters but that only has the bi parameters.
To confirm, while this is tagged with bi I assumed the initial_snapshot would work for the IC queries as well, is this true?
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1325612041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMNKDUWRJ5PR5B6U7RLWJZ3ZDANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
I attempted to run the paramgen but I must be missing something. I copied the factors folders into a factors folder I made within the paramgen folder so the resulting structure looks like the following:
ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/
cityNumPersons/ countryPairsNumFriends/ languageNumPosts/ personDays/ personLikesNumMessages/ personNumFriendTags/ sameUniversityConnected/
cityPairsNumFriends/ creationDayAndLengthCategoryNumMessages/ lengthNumMessages/ personDisjointEmployerPairs/ personNumFriendComments/ personNumFriends/ tagClassNumMessages/
companyNumEmployees/ creationDayAndTagClassNumMessages/ messageIds/ personFirstNames/ personNumFriendOfFriendCompanies/ personNumFriendsOfFriendsOfFriends/ tagClassNumTags/
countryNumMessages/ creationDayAndTagNumMessages/ people2Hops/ personKnowsPersonConnected/ personNumFriendOfFriendForums/ personStudyAtUniversityDays/ tagNumMessages/
countryNumPersons/ creationDayNumMessages/ people4Hops/ personKnowsPersonDays/ personNumFriendOfFriendPosts/ personWorkAtCompanyDays/ tagNumPersons/
After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY
to the data directory
export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk
And then I attempt to run the script while in the ldbc_snb_interactive_driver/paramgen
directory
./scripts/paramgen.sh
Traceback (most recent call last):
File "paramgen.py", line 273, in <module>
PG.run()
File "paramgen.py", line 110, in run
path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
self.create_views()
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views
self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet"
Do you have any suggestions for how I can resolve this?
Ooops, I forgot that the paramgen has undergone some changes recently and it needs the raw data sets for parameter selection. You can find them under the following links:
https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf30-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf100-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf300-raw.tar.zst https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1000-raw.tar.zst.000 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf1000-raw.tar.zst.001 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.000 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.001 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.002 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf3000-raw.tar.zst.003 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.000 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.001 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.002 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.003 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.004 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.005 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.006 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.007 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.008 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.009 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.010 https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/bi-raw/bi-sf10000-raw.tar.zst.011
On Monday, November 28, 2022, Chris Woodward @.***> wrote:
I attempted to run the paramgen but I must be missing something. I copied the factors folders into a factors folder I made within the paramgen folder so the resulting structure looks like the following:
ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/
cityNumPersons/ countryPairsNumFriends/ languageNumPosts/ personDays/ personLikesNumMessages/ personNumFriendTags/ sameUniversityConnected/ cityPairsNumFriends/ creationDayAndLengthCategoryNumMessages/ lengthNumMessages/ personDisjointEmployerPairs/ personNumFriendComments/ personNumFriends/ tagClassNumMessages/ companyNumEmployees/ creationDayAndTagClassNumMessages/ messageIds/ personFirstNames/ personNumFriendOfFriendCompanies/ personNumFriendsOfFriendsOfFriends/ tagClassNumTags/ countryNumMessages/ creationDayAndTagNumMessages/ people2Hops/ personKnowsPersonConnected/ personNumFriendOfFriendForums/ personStudyAtUniversityDays/ tagNumMessages/ countryNumPersons/ creationDayNumMessages/ people4Hops/ personKnowsPersonDays/ personNumFriendOfFriendPosts/ personWorkAtCompanyDays/ tagNumPersons/
After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY to the data directory
export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk
And then I attempt to run the script while in the ldbc_snb_interactive_driver/paramgen directory
./scripts/paramgen.sh
Traceback (most recent call last): File "paramgen.py", line 273, in
PG.run() File "paramgen.py", line 110, in run path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run self.create_views() File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views self.cursor.execute( duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet" Do you have any suggestions for how I can resolve this?
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1328412368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMK5C7JX4UFO4ORY633WKQD3DANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above?
Yes, they should be compatible
On Monday, November 28, 2022, Chris Woodward @.***> wrote:
Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above?
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1329266253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMOWRH6QUXKFNB5OP3LWKTDINANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
After downloading and unpacking I now receive the following:
root@dataloader-0:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh
Traceback (most recent call last):
File "paramgen.py", line 273, in <module>
PG.run()
File "paramgen.py", line 110, in run
path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
self.create_views()
File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views
self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/*.parquet"
The folders in dynamic are the following:
Comment/ Forum/ Forum_hasTag_Tag/ Person_hasInterest_Tag/ Person_likes_Comment/ Person_studyAt_University/ Post/ _SUCCESS
Comment_hasTag_Tag/ Forum_hasMember_Person/ Person/ Person_knows_Person/ Person_likes_Post/ Person_workAt_Company/ Post_hasTag_Tag/
You need both the factors and the raw data set. See the CI commands for an example on where to place these directories:
On Mon, Nov 28, 2022 at 8:38 PM Chris Woodward @.***> wrote:
After downloading and unpacking I now receive the following:
@.**:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh Traceback (most recent call last): File "paramgen.py", line 273, in
PG.run() File "paramgen.py", line 110, in run path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run self.create_views() File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views self.cursor.execute( duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/ .parquet"The folders in dynamic are the following:
Comment/ Forum/ Forum_hasTag_Tag/ Person_hasInterest_Tag/ Person_likes_Comment/ Person_studyAt_University/ Post/ _SUCCESS Comment_hasTag_Tag/ Forum_hasMember_Person/ Person/ Person_knows_Person/ Person_likes_Post/ Person_workAt_Company/ Post_hasTag_Tag/
— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1329650742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMJQZ6ETUB2A22WHK5DWKUCZ7ANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>
Thank you, the directory I was supplying was too far down. It seems to have run for quite a while but then I am given the following error
Traceback (most recent call last):
File "paramgen.py", line 273, in <module>
PG.run()
File "paramgen.py", line 122, in run
self.generate_parameter_for_query_type(self.start_date, self.start_date, "13b")
File "paramgen.py", line 200, in generate_parameter_for_query_type
self.cursor.execute(f"INSERT INTO 'Q_{query_variant}' SELECT * FROM ({parameter_query});")
duckdb.BinderException: Binder Error: Referenced column "useFrom" not found in FROM clause!
Candidate bindings: "personIds.diff"
@cw00dw0rd I added a sample script to the driver's CI that shows how to use the paramgen:
Let me know if this fails for any of the larger data sets -- if so, there is a problem with the data sets.
(Note that the ${LDBC_SNB_DATA_ROOT_DIRECTORY}
env var is currently used inconsistently for the conversion and the paramgen scripts. We'll fix this eventually - https://github.com/ldbc/ldbc_snb_interactive_driver/issues/219 - in the meantime, it's easy to work around.)
@szarnyasg thank you, I will give this a try today and report back.
Hi again 😄
In an experiment to support our database's multi-model functionality, we are trying to include the edges generated from the SNB Basic dataset, with the files generated from the SNB CompositeMergeForeign dataset.
~We are getting inconsistent results~, and wondered if there is any consideration of supporting this with the datagen, or if by any chance this is already possible?
For example, we want a data model where both the
post_hasCreator_person
relationship andcreator
attribute in thePost
document exist.Happy to move this conversation to the datagen repo if that makes more sense