Closed airtyon closed 1 year ago
Thanks for working on this and opening the draft @airtyon :pray:. I can imagine where you problem might be but its a bit guessing as I dont know what kind of test you wrote for Louvain Stream.
So what CommunityProcCompanion
does it when applying the filter - return the community id if valid and otherwise null
(org.neo4j.gds.CommunityProcCompanion.CommunitySizeFilter#value).
In the JavaDoc it also states, that this prevents it being written back in the write mode.
This whoever will still lead to returning a value for each node.
In the stream mode you still create the result stream from LongStream.range(0, graph.nodeCount())
(org/neo4j/gds/louvain/LouvainStreamProc.java:91).
Assuming you might have simply checked for the size of the stream, this would explain your confusion.
I would think of adding a post-filter to the result stream, so we dont include null values. Especially as for these we can also skip retrieving the intermediate
node ids.
hey @FlorentinD, thank you for the great feedback :)
I will try to explain my confusion so far.
the documentation for minCommunitySize
states:
Only community ids of communities with a size greater than or equal to the given value are
written
to Neo4j.
my first thoughts were, how would minCommunitySize
in Stream mode fit the description above if it's related to writing or we adapt the documentation to say read from
.
regarding the tests I wanted to extend for the Stream modes, I followed the same strategy as with Write modes. my expectation was that any time I have a graph written without the minCommunitySize
attribute, when I Stream it with the minCommunitySize
to retrieve the communities with a bigger size the attribute value.
I'm pretty sure now that this is not the right way to extend the test and I would love your insight about this.
thank you so much again.
For the documentation - the description is tailored to the write mode, as it was only exposed there. Now that we add it for stream as well, I would say lets adjust the message in the stream mode docs. How about Only nodes inside communities larger or equal the given value are returned
?
Your testing approach sounds pretty valid to me. You can directly use the result stream from the algorithm instead of querying the graph though, as the stream mode does not write it back to the db afterall.
Hope I could clear your confusion :)
I enjoyed working on this issue. thank you for your support :)
Likewise. Especially, that we could also find a bug along the way.
I will close this PR as its cherry picked and merged (https://github.com/neo4j/graph-data-science/commit/79e9ade413e0dd58ec85edc734886383533fe4a5). You can now build the jar locally if you want to use the feature immediately or wait for the next minor release.
in this PR:
minCommunitySize
parameter to:Closes: https://github.com/neo4j/graph-data-science/issues/243