Closed ghost closed 1 year ago
Hi @molfab ! Thank you for pointing this out. You are right that the predict modes of GraphSage are not deterministic.
The reason the predict modes are not deterministic is (only) because of the neighbor sampling stage. The reason graphsage uses sampling is that typically, the sampled neighbourhoods should have similar information to the full neighborhood, given that the sampleSizes
are large enough.
Have you observed poor performance of the downstream ML model ? If so, I'd suggest trying to increase sampleSize. Please let us know if that helps, it would be valuable feedback whether the problem persists.
We have some internal benchmarks, for example for CORA dataset, that show that good results are obtained are good despite of non-deterministic embedding. It could of course be different for your dataset.
Closing this issue as there's no more followup at this stage.
Problem
gds.beta.graphSage.stream
(orgds.beta.graphSage.mutate
) yields inconsistent embeddings for the same node between runs. In the specific case, given a noden
, I have a nonzero probability to obtain sensibly different embeddings for the same node in two consecutive runs.To Reproduce
GDS version: 2.1.12 Neo4j version: 4.4.11 Operating system: Debian (run into docker)
Steps to reproduce the behavior:
n
t
:n
Across runs: sometimes embeddings of node
n
are the same in two consecutive runs, sometimes they are different. There is no deterministic behaviour. As I am employing such embeddings in a ML pipeline, I do need embeddings of a noden
to be always the same if the graph and the model do not change.Am I doing something wrong or is there a bug?