neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

243 add min community size to non lpa community detection algorithms #245

Closed airtyon closed 1 year ago

airtyon commented 1 year ago

in this PR:

Closes: https://github.com/neo4j/graph-data-science/issues/243

FlorentinD commented 1 year ago

Thanks for working on this and opening the draft @airtyon :pray:. I can imagine where you problem might be but its a bit guessing as I dont know what kind of test you wrote for Louvain Stream.

So what CommunityProcCompanion does it when applying the filter - return the community id if valid and otherwise null (org.neo4j.gds.CommunityProcCompanion.CommunitySizeFilter#value). In the JavaDoc it also states, that this prevents it being written back in the write mode. This whoever will still lead to returning a value for each node. In the stream mode you still create the result stream from LongStream.range(0, graph.nodeCount()) (org/neo4j/gds/louvain/LouvainStreamProc.java:91).

Assuming you might have simply checked for the size of the stream, this would explain your confusion. I would think of adding a post-filter to the result stream, so we dont include null values. Especially as for these we can also skip retrieving the intermediate node ids.

airtyon commented 1 year ago

hey @FlorentinD, thank you for the great feedback :)

I will try to explain my confusion so far.
the documentation for minCommunitySize states:

Only community ids of communities with a size greater than or equal to the given value are written to Neo4j.

my first thoughts were, how would minCommunitySize in Stream mode fit the description above if it's related to writing or we adapt the documentation to say read from.

regarding the tests I wanted to extend for the Stream modes, I followed the same strategy as with Write modes. my expectation was that any time I have a graph written without the minCommunitySize attribute, when I Stream it with the minCommunitySize to retrieve the communities with a bigger size the attribute value.

I'm pretty sure now that this is not the right way to extend the test and I would love your insight about this.

thank you so much again.

FlorentinD commented 1 year ago

For the documentation - the description is tailored to the write mode, as it was only exposed there. Now that we add it for stream as well, I would say lets adjust the message in the stream mode docs. How about Only nodes inside communities larger or equal the given value are returned?

Your testing approach sounds pretty valid to me. You can directly use the result stream from the algorithm instead of querying the graph though, as the stream mode does not write it back to the db afterall.

Hope I could clear your confusion :)

FlorentinD commented 1 year ago

Add minCommunitySize to non-LPA community detection algorithms, for stream & write modes

airtyon commented 1 year ago

I enjoyed working on this issue. thank you for your support :)

FlorentinD commented 1 year ago

Likewise. Especially, that we could also find a bug along the way.

I will close this PR as its cherry picked and merged (https://github.com/neo4j/graph-data-science/commit/79e9ade413e0dd58ec85edc734886383533fe4a5). You can now build the jar locally if you want to use the feature immediately or wait for the next minor release.