neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

The same GraphSage model gives different results between the runs. #221

Closed ghost closed 1 year ago

ghost commented 1 year ago

Problem

To Reproduce

GDS version: 2.1.12 Neo4j version: 4.4.11 Operating system: Debian (run into docker)

Steps to reproduce the behavior:

Across runs: sometimes embeddings of node n are the same in two consecutive runs, sometimes they are different. There is no deterministic behaviour. As I am employing such embeddings in a ML pipeline, I do need embeddings of a node n to be always the same if the graph and the model do not change.

Am I doing something wrong or is there a bug?

breakanalysis commented 1 year ago

Hi @molfab ! Thank you for pointing this out. You are right that the predict modes of GraphSage are not deterministic. The reason the predict modes are not deterministic is (only) because of the neighbor sampling stage. The reason graphsage uses sampling is that typically, the sampled neighbourhoods should have similar information to the full neighborhood, given that the sampleSizes are large enough.

Have you observed poor performance of the downstream ML model ? If so, I'd suggest trying to increase sampleSize. Please let us know if that helps, it would be valuable feedback whether the problem persists.

We have some internal benchmarks, for example for CORA dataset, that show that good results are obtained are good despite of non-deterministic embedding. It could of course be different for your dataset.

brs96 commented 1 year ago

Closing this issue as there's no more followup at this stage.