Inquiry: SNB Basic + SNB Composite Merge Foreign?

aMahanna commented 2 years ago

Hi again 😄

In an experiment to support our database's multi-model functionality, we are trying to include the edges generated from the SNB Basic dataset, with the files generated from the SNB CompositeMergeForeign dataset.

~We are getting inconsistent results~, and wondered if there is any consideration of supporting this with the datagen, or if by any chance this is already possible?

For example, we want a data model where both the post_hasCreator_person relationship and creator attribute in the Post document exist.

Happy to move this conversation to the datagen repo if that makes more sense

szarnyasg commented 2 years ago

Hi @aMahanna,

Two things:

Can you please clarify how the results are inconsistent?

The datagen is deterministic, so the graphs (including the IDs) should be the same between different generators. Therefore, combining files from data sets should be possible without getting inconsistency.

For pre-processing data sets, I have two approaches:

i. Using the usual UNIX tools like grep, cat, cut, etc. These work well for splitting files.

ii. Using DuckDB. This approach also allows joins, aggregation (string_agg) or unwinding (unnest). I have a couple of example scripts for SNB BI: https://github.com/ldbc/ldbc_snb_example_data/tree/main/export

Gabor

aMahanna commented 2 years ago

Can you please clarify how the results are inconsistent?

Apologies for the delay and for the confusion, after further investigation we discovered a formatting mistake on our part when combining the files.

I will follow up shortly regarding your second point, but for now I just want to say thank you for all the help so far

Anthony

aMahanna commented 2 years ago

Hi Gabor,

We've been evaluating the various SNB datasets available in attempt to support our database's multi-model functionality.

We found that using a combination of the Basic & MergeForeign datasets substantially increases our query performance and better suits our data model. Our request would be to have the datagen natively support the data model outlined below, or suggest a way to do so if it already exists. As it stands now, modelling the data in this way requires a lot of pre/post processing (as suggested above), which we believe will count against us if we were to have the benchmark audited.

In particular, we have situations where a query benefits from the Basic dataset (IC8), a query that benefits from the MergeForeign dataset (IC3 Sub-Query A), and another query that benefits from a combination of both (IC3 Sub-Query B).

IC8

Understanding that you may not be familiar with AQL (Arango Query Language), this query relies on the edge relationships only available in the Basic dataset (e.g post_hasCreator_person, comment_hasCreator_person, etc.).

FOR commentReply IN 2..2 INBOUND @personId post_hasCreator_person, comment_hasCreator_person, comment_replyOf_post, comment_replyOf_comment
    SORT commentReply.creationDate DESC, commentReply._id
    LIMIT 20
    FOR creator IN 1..1 OUTBOUND commentReply comment_hasCreator_person
        RETURN {
            id: creator._id,
            firstName: creator.firstName,
            lastName: creator.lastName,
            commentId: commentReply._id,
            commentCreationDate: commentReply.creationDate,
            commentContent: commentReply.content
        }

The alternative approach is to solely rely on the MergeForeign attributes (i.e creator, replyOfPost, replyOfComment). Seeing that none of the edge relationships mentioned above are included in MergeForeign, switching to these attributes would result in a query performance that is 6x slower than the current implementation. On the other hand, sticking to a Basic-only data model poses its own challenges, as seen below.

IC3

We've noticed peak performance in IC3 when a combination of Basic SNB edge relationships & MergeForeign SNB attributes are used within the same query.

IC3 Sub-Query A

A portion of IC3 relies on the person.place MergeForeign attribute for efficient query performance.

FOR friend IN 1..2 ANY @personId person_knows_person OPTIONS {bfs: true, uniqueVertices:"global"}
    FILTER friend.place NOT IN [countryXKey, countryYKey]
    RETURN {id: friend.id, place: friend.place}

Attempting to do this using the Basic SNB person_isLocatedIn_place edge relationship results in a query performance that is 70x slower.

IC3 Sub-Query B

Another portion of IC3 relies on the post.place and the comment.place MergeForeign attributes, while also benefitting from the post_hasCreator_person and comment_hasCreator_person relationships (found only in the Basic SNB dataset).

FOR message IN 1..1 INBOUND friend post_hasCreator_person,comment_hasCreator_person
    FILTER message.place IN [countryXKey, countryYKey]
    RETURN message

Attempting to do this using the Basic SNB post_isLocatedIn_place & comment_isLocatedIn_place edge relationships results in a query performance that is 30x slower.

Conclusion

As far as we can tell, the current datagen utility doesn't support this, and so we feel that this leaves out the multi-model graph capabilities offered by our database. We are not looking to manipulate the data in a way that specifically favours us, but instead looking for the LDBC datagen to better support the functionality of multi-model graph databases.

Would it be possible to have the datagen support this data model out of the box (assuming it doesn't already)?

szarnyasg commented 2 years ago

@aMahanna I transferred the issue to the (new, Spark-based) Datagen's repository. I skimmed your suggestion and it seems doable in Datagen albeit it will not have a high priority in our development plans.

This week I'm travelling/have other duties -- I will take a look next week.

szarnyasg commented 2 years ago

Hello again,

The bad news: this functionality is unlikely to be supported in the Datagen.

The good news: I have generated the data sets and uploaded them to Cloudflare R2 (an egress-free object storage):

Composite Merged FK

Composite Projected FK

Gabor

cw00dw0rd commented 2 years ago

Hi @szarnyasg

Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to?

Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets?

Thank you again for all the help so far!

Chris

szarnyasg commented 2 years ago

Hi,

It’s the same schema as before but R2 is (slightly) faster than SURF.

Sure, we are open for reviewing PRs in the Datagen.

Gabor

On Tue, 18 Oct 2022 at 21:27, Chris Woodward @.***> wrote:

Hi @szarnyasg https://github.com/szarnyasg

Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to?

Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets?

Thank you again for all the help so far!

Chris

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1282899783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMPGTJKYVWHHXLLX6S3WD322VANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

szarnyasg commented 2 years ago

By the way, maybe an important piece of information that's missing from the discussion above: systems can pre-process the data set before loading. So you can take e.g. the composite merge foreign CSV files, run them through a script (which can use anything cut, Perl scripts, DuckDB SQL script, etc.) and create a new set of CSV files, then load those into the system-under-test. We try to avoid this in the reference implementations but it is definitely a possibility.

cw00dw0rd commented 2 years ago

Hi Gabor @szarnyasg

Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:

/*stdin*\ : Read error (39) : premature end
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

The command I ran was the following: tar --use-compress-program=unzstd -xvf bi-sf1000-composite-projected-fk.tar.zst.000

I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions?

szarnyasg commented 2 years ago

Hi Chris,

Use cat + tar + unztsd: https://github.com/ldbc/auditing-tools/blob/main/cloudflare-r2.md#recombining-and-decompressing-data-sets For this, you'll need the 000, 001, etc. files in the same location.

Gabor

On Tue, Nov 15, 2022 at 4:55 PM Chris Woodward @.***> wrote:

Hi Gabor @szarnyasg https://github.com/szarnyasg

Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:

/stdin\ : Read error (39) : premature end tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now

The command I ran was the following: tar --use-compress-program=unzstd -xvf bi-sf1000-composite-projected-fk.tar.zst.000

I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions?

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1315976787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMNQA3CLW2ZXTOEKVZDWIQIHHANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

Hi Gabor,

I am unable to access that link, it shows 404.

Chris

szarnyasg commented 2 years ago

Oops, I linked to a private repo :). This is its public counterpart:

https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#streaming-decompression

On Tue, Nov 15, 2022 at 9:02 PM Chris Woodward @.***> wrote:

Hi Gabor,

I am unable to access that link, it shows 404.

Chris

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1316234542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMPE6FRNZGT4VICU6X3WIRFE7ANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

Thank you, that did the trick!

cw00dw0rd commented 2 years ago

Hi Gabor,

Do you, by chance, have the IC substitution parameters used for the datasets you shared above? I found this: https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters but that only has the bi parameters.

To confirm, while this is tagged with bi I assumed the initial_snapshot would work for the IC queries as well, is this true?

szarnyasg commented 2 years ago

Hi Chris,

We are working on tuning the parameter generator for the new Interactive workload.

Currently your best bet would be to download the factor tables from [1] and run them through the Interactive v2 driver’s paramgen at [2]. These will give valid parameters for queries on the initial snapshot.

The final version of the paramgen for Interactive v2 will produce parameters bucketed by days (in the network’s simulation time) and will be better calibrated to ensure stable runtimes (i.e. the runtimes will follow a Gaussian distribution more closely). This is worked on but still a few weeks away at the moment.

Best, Gabor PS: Most of the SNB task force is currently busy with other tasks / on holiday, so there will be some delay in answering issues in the coming week.

[1] https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#factor-tables [2] https://github.com/ldbc/ldbc_snb_interactive_driver/tree/main/paramgen

On Wed, 23 Nov 2022 at 21:18, Chris Woodward @.***> wrote:

Hi Gabor,

Do you, by chance, have the IC substitution parameters used for the datasets you shared above? I found this: https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters but that only has the bi parameters.

To confirm, while this is tagged with bi I assumed the initial_snapshot would work for the IC queries as well, is this true?

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1325612041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMNKDUWRJ5PR5B6U7RLWJZ3ZDANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

I attempted to run the paramgen but I must be missing something. I copied the factors folders into a factors folder I made within the paramgen folder so the resulting structure looks like the following:

ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/

cityNumPersons/                          countryPairsNumFriends/                  languageNumPosts/                        personDays/                              personLikesNumMessages/                  personNumFriendTags/                     sameUniversityConnected/
cityPairsNumFriends/                     creationDayAndLengthCategoryNumMessages/ lengthNumMessages/                       personDisjointEmployerPairs/             personNumFriendComments/                 personNumFriends/                        tagClassNumMessages/
companyNumEmployees/                     creationDayAndTagClassNumMessages/       messageIds/                              personFirstNames/                        personNumFriendOfFriendCompanies/        personNumFriendsOfFriendsOfFriends/      tagClassNumTags/
countryNumMessages/                      creationDayAndTagNumMessages/            people2Hops/                             personKnowsPersonConnected/              personNumFriendOfFriendForums/           personStudyAtUniversityDays/             tagNumMessages/
countryNumPersons/                       creationDayNumMessages/                  people4Hops/                             personKnowsPersonDays/                   personNumFriendOfFriendPosts/            personWorkAtCompanyDays/                 tagNumPersons/

After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY to the data directory

export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk

And then I attempt to run the script while in the ldbc_snb_interactive_driver/paramgen directory

./scripts/paramgen.sh

Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 110, in run
    path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
    list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
    self.create_views()
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views
    self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet"

Do you have any suggestions for how I can resolve this?

szarnyasg commented 2 years ago

Ooops, I forgot that the paramgen has undergone some changes recently and it needs the raw data sets for parameter selection. You can find them under the following links:

On Monday, November 28, 2022, Chris Woodward @.***> wrote:

I attempted to run the paramgen but I must be missing something. I copied the factors folders into a factors folder I made within the paramgen folder so the resulting structure looks like the following:

ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/

cityNumPersons/ countryPairsNumFriends/ languageNumPosts/ personDays/ personLikesNumMessages/ personNumFriendTags/ sameUniversityConnected/ cityPairsNumFriends/ creationDayAndLengthCategoryNumMessages/ lengthNumMessages/ personDisjointEmployerPairs/ personNumFriendComments/ personNumFriends/ tagClassNumMessages/ companyNumEmployees/ creationDayAndTagClassNumMessages/ messageIds/ personFirstNames/ personNumFriendOfFriendCompanies/ personNumFriendsOfFriendsOfFriends/ tagClassNumTags/ countryNumMessages/ creationDayAndTagNumMessages/ people2Hops/ personKnowsPersonConnected/ personNumFriendOfFriendForums/ personStudyAtUniversityDays/ tagNumMessages/ countryNumPersons/ creationDayNumMessages/ people4Hops/ personKnowsPersonDays/ personNumFriendOfFriendPosts/ personWorkAtCompanyDays/ tagNumPersons/

After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY to the data directory

export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk

And then I attempt to run the script while in the ldbc_snb_interactive_driver/paramgen directory

./scripts/paramgen.sh

Traceback (most recent call last): File "paramgen.py", line 273, in PG.run() File "paramgen.py", line 110, in run path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run self.create_views() File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views self.cursor.execute( duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet"

Do you have any suggestions for how I can resolve this?

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1328412368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMK5C7JX4UFO4ORY633WKQD3DANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above?

szarnyasg commented 2 years ago

Yes, they should be compatible

On Monday, November 28, 2022, Chris Woodward @.***> wrote:

Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above?

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1329266253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMOWRH6QUXKFNB5OP3LWKTDINANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

After downloading and unpacking I now receive the following:

root@dataloader-0:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh
Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 110, in run
    path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
    list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
    self.create_views()
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views
    self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/*.parquet"

The folders in dynamic are the following:

Comment/                   Forum/                     Forum_hasTag_Tag/          Person_hasInterest_Tag/    Person_likes_Comment/      Person_studyAt_University/ Post/                      _SUCCESS
Comment_hasTag_Tag/        Forum_hasMember_Person/    Person/                    Person_knows_Person/       Person_likes_Post/         Person_workAt_Company/     Post_hasTag_Tag/

szarnyasg commented 2 years ago

You need both the factors and the raw data set. See the CI commands for an example on where to place these directories:

https://github.com/ldbc/ldbc_snb_interactive_impls/blob/59d5fb15869464adf60400fca20554bc717dbc08/.circleci/config.yml#L48-L73

On Mon, Nov 28, 2022 at 8:38 PM Chris Woodward @.***> wrote:

After downloading and unpacking I now receive the following:

@.**:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh Traceback (most recent call last): File "paramgen.py", line 273, in PG.run() File "paramgen.py", line 110, in run path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days) File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run self.create_views() File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views self.cursor.execute( duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/.parquet"

The folders in dynamic are the following:

Comment/ Forum/ Forum_hasTag_Tag/ Person_hasInterest_Tag/ Person_likes_Comment/ Person_studyAt_University/ Post/ _SUCCESS Comment_hasTag_Tag/ Forum_hasMember_Person/ Person/ Person_knows_Person/ Person_likes_Post/ Person_workAt_Company/ Post_hasTag_Tag/

— Reply to this email directly, view it on GitHub https://github.com/ldbc/ldbc_snb_datagen_spark/issues/394#issuecomment-1329650742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMJQZ6ETUB2A22WHK5DWKUCZ7ANCNFSM5ZM5GXCA . You are receiving this because you were mentioned.Message ID: @.***>

cw00dw0rd commented 2 years ago

Thank you, the directory I was supplying was too far down. It seems to have run for quite a while but then I am given the following error

Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 122, in run
    self.generate_parameter_for_query_type(self.start_date, self.start_date, "13b")
  File "paramgen.py", line 200, in generate_parameter_for_query_type
    self.cursor.execute(f"INSERT INTO 'Q_{query_variant}' SELECT * FROM ({parameter_query});")
duckdb.BinderException: Binder Error: Referenced column "useFrom" not found in FROM clause!
Candidate bindings: "personIds.diff"

szarnyasg commented 1 year ago

@cw00dw0rd I added a sample script to the driver's CI that shows how to use the paramgen:

https://github.com/ldbc/ldbc_snb_interactive_driver/blob/bb80725214ada3639869cc5aa1b546298b90e6a9/.circleci/config.yml#L71-L93

Let me know if this fails for any of the larger data sets -- if so, there is a problem with the data sets.

(Note that the ${LDBC_SNB_DATA_ROOT_DIRECTORY} env var is currently used inconsistently for the conversion and the paramgen scripts. We'll fix this eventually - https://github.com/ldbc/ldbc_snb_interactive_driver/issues/219 - in the meantime, it's easy to work around.)

cw00dw0rd commented 1 year ago

@szarnyasg thank you, I will give this a try today and report back.

ldbc / ldbc_snb_datagen_spark