neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Fastrp embeddings #234

Closed meg261995 closed 1 year ago

meg261995 commented 1 year ago

Hi,

I am trying to solve a multiclass classification problem with fastrp embeddings as input features. The problem I am facing is that the cosine similarity of the embeddings of different classes is very near to each other and as a result, my model is performing very poorly on unseen data. I'm trying to include as many nodeProperties as possible that will help distinguish between the classes but doesn't seem to work. I do not think my graph being heterogenous is the issue as fastrp supports this. What am I missing?

brs96 commented 1 year ago

Hi @meg261995 thanks for getting in touch!

A few suggestions that I have is:

  1. Set propertyRatio = 1.0 and set random seed whenever it's an optional parameter. (https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/#node-embeddings-generalization) This makes sure new nodes you're predicting on have embeddings initialized in the same way as the train&test ones, which is a requirement for using embedding in an inductive setting.
  2. While passing in a heterogeneous graph works for FastRP, the algorithm does not have the heterogeneous trait (https://neo4j.com/docs/graph-data-science/current/introduction/#introduction-algorithms-heterogeneous) That means, it will not treat nodes differently based on labels.
  3. On including as many nodeProperties as possible: This will work well if the properties are representative and discriminative of the classes, which require much effort in engineering them carefully. By studying the data, if you see that different classes exhibit some unique graph-theoretic properties, then adding them as pre-processing steps using our graph algorithms could help too. (https://neo4j.com/docs/graph-data-science/current/machine-learning/pre-processing/)

I hope this helps, let us know if you have any other question.

Brian

meg261995 commented 1 year ago

Okay, let me know if I am right. Suppose for example we a graph

(:Person)-[:BELONGS_TO]->(:Country) (:Person)-[:WORKS_IN]->(:Company) (:Person)-[:LIKES]->(:Food) (:Country)-[:STAPLE]->(:Food) (:Company)-[:HEAD_QUATERS]->(:Country)

So there are a total of 4 nodes with different labels and each label contains many values. Now if I want to generate fastrp for :Person just by taking the subgraph (:Person)-[:BELONGS_TO]->(:Country) and (:Person)-[:WORKS_IN]->(:Company) , 1) The algorithm would not differentiate between :Person and :Country node based on their separate values? 2) Even within the country label all values are treated same?

Even though we give embeddings of these values in featureProperties something like this?

CALL gds.fastRP.stream('name', { embeddingDimension: n, nodeLabels:['Person', 'Company', 'Country'], relationshipTypes:[ 'BELONGS_TO' , 'WORKS_IN' ], featureProperties:[ 'person_name_embedding' , 'company_name_embedding' , 'country_name_embedding' ], propertyRatio:0.2 } )

meg261995 commented 1 year ago

Adding, I am not using the node classification pipeline in neo4j. After the fastrp embeddings are generated, I use them outside in classical ML as input features

brs96 commented 1 year ago

On both questions: That's right, the algorithm will treat all the nodes with different labels the same way.

However, the featureProperties that you pass in, and propertyRatio you set, influences the initial embedding on each node. If propertyRatio = 0, then FastRP will be purely based on the topology of the graph. If propertyRatio = 1, then it will use only the given properties as initial embedding, and iteratively update based on topology. And of course between [0,1] then it's a mix.

Hope this helps.

Out of curiosity any specific reason that you prefer to export the embeddings to another ML library instead of leveraging the piplines?

meg261995 commented 1 year ago

Yes this helps a lot. Thanks! And I found this method simpler actually. I could not find any proper example that illustrates training and predictions on heterogenous graph properly. It was a bit confusing to figure out train and test splits since mine is a complex ecommerce data.

meg261995 commented 1 year ago

So if the features in featureProperties properly distinguish different classes and with propertyRatio = 1, I would be able to get meaningful embeddings that'd tell the classes apart.

meg261995 commented 1 year ago

Adding, Could you please clarify what you meant by, 'If propertyRatio = 1, then it will use only the given properties as initial embedding, and iteratively update based on topology'. Because the documentation says propertyDimension = embeddingDimension * propertyRatio, which means if propertyRatio = 1 then all the emb dimensions will be based on featureproperties not just the initial ones.

Mats-SX commented 1 year ago

Hello @meg261995.

You are correct to conclude that FastRP in and of itself does not support heterogeneous graphs. It does not have any mechanic to tell different nodes apart based on some type information. Same for relationships of different types. However, as Brian alludes to, FastRP can implicitly tell nodes apart based on their property values, provided that there are such distinguishing property values to use. This implicit mechanism is not bullet-proof, but is effective. It does however require that a property value exist on all nodes in order to make use of this quality.

For example, you might have node types X and Y with property a. If a is only defined for X (e.g. a customer email for customer nodes, which does not exist for product nodes), you need to fill it out for Y yourself, for example with a defaultValue. If a is defined for both X and Y and contains values that correlate with the X and Y identities (for example a in [0, 100] for X and a in [-100, 0] for Y], then this will captured in the embedding.

(Note that categorical properties should be one-hot encoded so that their numerical values are not interpreted geometrically.)

With that in mind, I think your statement is correct:

So if the features in featureProperties properly distinguish different classes and with propertyRatio = 1, I would be able to get meaningful embeddings that'd tell the classes apart.

And for your last question:

FastRP begins by creating an initial embedding for each node. This is a vector of floating-point numbers of size embeddingDimension. In the default case, we follow the original paper and generate these numbers randomly. But if we set propertyRatio > 0, then a portion of them are not random, but based on values read from the featureProperties. In the extreme case propertyRatio = 1 the initial embedding is completely deterministic based on the data values.

Then, FastRP iterates the configured number of times. In each iteration, a node will influence the embeddings of its neighbours, using the configured iterationWeight and its current embedding, starting with the initial embedding. This influence follows the topology of the graph -- nodes will only influence their neighbours. The configured iterationWeights determine a node's reach across the graph. By default FastRP uses three iterations.

So, you are correct in that all the embeddings will be based on values taken from the configured featureProperties, but they will be modified during the iterations into a resulting embedding. Consequently, the resulting embedding is a function of neighbouring property values as well as a node's own property values. So there are two effects at play:

  1. nodes with similar property values will have similar initial embeddings, and influence their respective neighbourhoods similarly.
  2. nodes with similar neighbourhoods will be influenced similar to each other.

The result is that nodes that have similar property values and similar neighbourhoods will obtain the most similar output embeddings. The meaning of "similar neighbourhood" changes a bit with the value of propertyRatio. When propertyRatio = 0, neighbourhoods are similar if they have a similar topology (e.g. a star, a clique, a tree, etc). When propertyRatio = 1, neighbourhoods are similar if they have a similar topology AND the nodes in the respective positions have similar property values. For example, if the center node in a star neighbourhood has very different property values to the center node in another star neighbourhood, then these may result in different embeddings.

I hope this helps! Mats

Mats-SX commented 1 year ago

As this issue does not indicate any bug or problem with the GDS product, I will close the issue.

meg261995 commented 1 year ago

Thanks :)