neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

GDS 2.1.12 Don't filter Nodes When Runs an Algorithm #222

Closed PonceOrtiz closed 1 year ago

PonceOrtiz commented 1 year ago

Describe the bug

To Reproduce

GDS version: 2.1.12 Neo4j Desktop version: 1.4.15 Operating system: Windows 10

Steps to reproduce the behavior:

Expected behavior When you specify node filters ['User'] , you expected only labels "User" ,but instead all nodes are retrieved

Additional context file dump database can't be uploaded by size restrictions , you can test this beheavor in your own graph database with GDS version: 2.1.12

scripts.txt

FlorentinD commented 1 year ago

Hello @PonceOrtiz , The error is in your projection. Using * for the node projection, will result in GDS not knowing of any node labels. So node-filter has no effect here and we return all labels as the User is not kn.

Arguably, this is not the best user experience and we fixed this in the next upcoming version (2.2) to throw an exception in the future.

For now, replacing the * node projection with an explicit list of all node labels in your graph should solve your issue.

PonceOrtiz commented 1 year ago

Hi Florentin

Yes I'm using explicit node declaration now , but the reason to use ['*'] it was just to have one graph projection using diferent graph algorithms

Thanks

El jue., 29 de septiembre de 2022 9:20, Florentin Dörre < @.***> escribió:

Hello @PonceOrtiz https://github.com/PonceOrtiz , The error is in your projection. Using * for the node projection, will result in GDS not knowing of any node labels. So node-filter has no effect here and we return all labels as the User is not kn.

Arguably, this is not the best user experience and we fixed this in the next upcoming version (2.2) to throw an exception in the future.

For now, replacing the * node projection with an explicit list of all node labels in your graph should solve your issue.

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/222#issuecomment-1262356047, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3K3525YGITQV5YJPBUAB2DWAWQUTANCNFSM6AAAAAAQYY6JRI . You are receiving this because you were mentioned.Message ID: @.***>

FlorentinD commented 1 year ago

I dont see why you cant have one graph projection for different graph algorithms when using explicit node declaration.

You project more than one label in the gds.graph.project call such as ['A', 'B', 'C'] and in the graph algorithms you can specify a nodeLabels which is a subset of the projected node-labels such as ['B', 'C'].

FlorentinD commented 1 year ago

I will close this issue for now. Feel free to reopen if there is something missing.

PonceOrtiz commented 1 year ago

Hi Florentin,

Hi Florentin, Why are you closing it ? It's a bug , the code to run an algorithm is not working as documentation tells. I've tell you a reason to use it in that way (memory allocation) , even more if someone is using the same way in a lower version in GDS , then upgrading it will not work ... a code must be reliable , consistent

Who´s the product owner of GDS ?

[image: bug.png]

El lun, 10 oct 2022 a la(s) 07:57, Florentin Dörre @.***) escribió:

I will close this issue for now. Feel free to reopen if there is something missing.

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/222#issuecomment-1273275780, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3K3527VPTZ33AEG5WGSOVLWCQHCBANCNFSM6AAAAAAQYY6JRI . You are receiving this because you were mentioned.Message ID: @.***>

FlorentinD commented 1 year ago

Hello PonceOrtiz, there seems to be a misunderstanding then. I closed it because you never responded to my last response, in which I tried to understand your use-case better.

Also, to my understanding, using the explicit declaration works for you. So I dont see how this is actually a bug, but rather a misuse of the * projection.

Could you link me the documentation where we explain your expected behavior?

Unfortunately, the attached image wont show when you respond via email.

PonceOrtiz commented 1 year ago

Ok I understand , here is the link In nodelabels section is what im talking about

https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/

El lun., 10 de octubre de 2022 8:45, Florentin Dörre < @.***> escribió:

Hello PonceOrtiz, there seems to be a misunderstanding then. I closed it because you never responded to my last response, in which I tried to understand your use-case better.

Also, to my understanding, using the explicit declaration works for you. So I dont see how this is actually a bug, but rather a misuse of the * projection.

Could you link me the documentation where we explain your expected behavior?

Unfortunately, the attached image wont show when you respond via email.

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/222#issuecomment-1273335393, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3K3527KYIGFELDYXZK3HYLWCQMWZANCNFSM6AAAAAAQYY6JRI . You are receiving this because you were mentioned.Message ID: @.***>

FlorentinD commented 1 year ago

The nodeLabels parameter talks about Filter the named graph using the given node labels.. Here node labels refers to node labels known inside of GDS. These are defined in the node projection, you specify during graph.graph.project.

Hence, when * is specified on gds.graph.project, you can verify with gds.graph.list, that only one node label is known to GDS. Therefore, it is unexpected that node labels specified for an algorithm would work with anything else then *.

We document the specifically the nodeLabels parameter at https://neo4j.com/docs/graph-data-science/current/common-usage/running-algos/#common-configuration-node-labels. There we also mention If the graph, on which the algorithm is run, was projected with multiple node label projections [...]

PonceOrtiz commented 1 year ago

Well is so clear that this is not happening : filtering nodeLabels - List of String

If the graph, on which the algorithm is run, was projected with multiple node label projections, this parameter can be used to select only a subset of the projected labels. The algorithm will only consider nodes with the selected labels.

El lun., 10 de octubre de 2022 9:28, Florentin Dörre < @.***> escribió:

The nodeLabels parameter talks about Filter the named graph using the given node labels.. Here node labels refers to node labels known inside of GDS. These are defined in the node projection, you specify during graph.graph.project.

Hence, when is specified on gds.graph.project, you can verify with gds.graph.list, that only one node label is known to GDS. Therefore, it is unexpected that node labels specified for an algorithm would work with anything else then .

We document the specifically the nodeLabels parameter at https://neo4j.com/docs/graph-data-science/current/common-usage/running-algos/#common-configuration-node-labels . There we also mention If the graph, on which the algorithm is run, was projected with multiple node label projections [...]

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/222#issuecomment-1273402416, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3K352ZADEIV25ZVP4ETO23WCQRXFANCNFSM6AAAAAAQYY6JRI . You are receiving this because you were mentioned.Message ID: @.***>

FlorentinD commented 1 year ago

Yes if the graph was projected with multiple node label projections. But your used node projection ['*'] in the gds.graph.project call is a single node projection.

Do you mean, its a bug, that we dont throw an error on the filtered query? If thats the case, I can recommend upgrading to 2.2 where we improved the filter validation.

With 2.2, you will get an error when trying to execute your filtered wcc.

Failed to invoke procedure `gds.wcc.stream`: Caused by: java.lang.IllegalArgumentException: Could not find the specified `nodeLabels` of ['User']. Available labels are ['__ALL__'].
Mats-SX commented 1 year ago

@PonceOrtiz In general, we recommend to not use * (star) projections for anything other than quick testing. The reason is that star projections (intentionally) erases label/type information and treats every node as being of the same kind. In order to tell nodes apart on basis of their labels, you must specify exactly which labels to use.

If you want the behaviour where all node labels are projected from the database to the GDS graph, you must specify all of them. Fortunately, there are helper procedures that can save you some typing, if you have many labels or changing labels.

To illustrate very concretely, what you can do is this:

CALL db.labels() YIELD label 
WITH collect(label) AS allLabelsInDatabase
CALL db.relationshipTypes() YIELD relationshipType 
WITH allLabelsInDatabase, collect(relationshipType) AS allRelationshipTypesInDatabase 
CALL gds.graph.project('myGraph', allLabelsInDatabase, allRelationshipTypesInDatabase)
YIELD nodeCount, nodeProjection, relationshipCount, relationshipProjection, projectMillis 
RETURN *

(The above can of course be greatly simplified if you know all your labels, by hard-coding the projections as literal lists.)

If you want to also have node or relationship properties, you have to add that part as well to the projections.

Then, with this graph being projected, you can use any subset of the node labels and/or relationship types in your algorithm configurations using the nodeLabels and relationshipTypes parameters.

Hope this helps! All the best Mats

PonceOrtiz commented 1 year ago

Well , the documentation says that you can use ['*'] , then filter node labels ...not me , and I decide to use in that way just for memory resources , avoid redundance , etc.

If this feature is wrong , then change the documentation...in my opinion

Thanks

El vie., 30 de septiembre de 2022 2:43, Florentin Dörre < @.***> escribió:

I dont see why you cant have one graph projection for different graph algorithms when using explicit node declaration.

You project more than one label in the gds.graph.project call such as ['A', 'B', 'C'] and in the graph algorithms you can specify a nodeLabels which is a subset of the projected node-labels such as ['B', 'C'].

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/222#issuecomment-1263223355, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3K3526R3QM7DVYUWFGZNYDWA2KYHANCNFSM6AAAAAAQYY6JRI . You are receiving this because you were mentioned.Message ID: @.***>

Mats-SX commented 1 year ago

Is it the default value column that is confusing?

Screenshot 2022-10-11 at 18 00 24

Or is there another location in the manual where we mention the star projection?

PonceOrtiz commented 1 year ago

@Mats-SX Hi thanks for your suggestion, but it is difficult to understand how can exists recommendations about how not use GDS , maybe there is a Best Practices GDS guide that I´dont read yet. Resources are finite , I understood to have one Graph Projection then filter the nodes that I need for every algorithm. Well I´ll appreciate any reference for best practices using GDS Thanks in advance

PonceOrtiz commented 1 year ago

Hola @FlorentinD , I just have to say that there are philosophical discussions about "What it is a bug" but let me share some definition about it : a bug is not to have the expected behavior or execution (error raised) I don't have the answer why one version handle an error and other no , in this case the reason could be "bad documentation" in GDS 2.12 I just applied the workaround using explicit Node load at graph projection.

What I am worry about it ? some calculations in Enterprise Solutions are made based in GDS documentation , even some exercises in Neo4j Web Page use ['*'] in Nodes and RelationShips projection.

I'll be attend if any "Best practices" ara available to GDS Thanks

FlorentinD commented 1 year ago

Hello @PonceOrtiz , For why 2.2 handles this as an error, is due to us improving the validation of the user input. As 2.1 is declared EOL now and upgrading to 2.2 is recommended.

I would close this issue now as the bug is fixed in 2.2 and we wont backport this validation to 2.1.

And using * does work, I will bring up to improve our docs about * during projection will loose label information in our docs. IMO, its a fair point, that several examples out there do you use *, so we should make the implications clear.

Thanks a lot for bringing topic up!