HinSAGE link prediction for heterogeneous graphs with more than 2 node types

sophiakrix commented 2 years ago

Description

Enable the HinSAGE model to do link prediction on a heterogeneous graph with more than 2 node types and multiple edge types.

User Story

I am trying to predict links in a heterogeneous (custom) graph and want to use the GraphSAGE implementation for heterogeneous graphs (HinSAGE) for this. In the hinsage link prediction tutorial as well as in the documentation for unsupervised node feature learning with HinSAGE, it is stated that only two head node types are accepted.

How can I use the existing HinSAGELinkGenerator to predict links of different edge types between different node type pairs? In my example, I've got a knowledge graph consisting of drugs, diseases, proteins, functions and side effects.

To give a better insight into what I am trying to achieve: I would like to predict

which type of link exists between node type drug and node type disease (out of multiple edge types (increasing, decreasing))
which type of link exists between node type drug and node type protein (out of multiple edge types (inhibiting, activating, etc.))

In the documentation, it is said that

one approach to obtain embeddings for all nodes in a heterogeneous graph would be to run this model separately for each node type

Could you elaborate on this? I was wondering whether this means that I would have to train 5 different models (one for each node type of disease, drug, protein, side effect, function) and each model can predict multiple edge types or if I have to train a model for each edge type.

Would I have to split the graph into subgraphs of two node types each? I am asking myself this because I created a StellarGraph object of a heterogeneous graph with 5 node types and then got the following error:

generator = HinSAGELinkGenerator(
    G, batch_size, num_samples, head_node_types=['disease', 'protein']
)
train_gen = generator.flow(edgelist_train, labels_train, shuffle=True)
test_gen = generator.flow(edgelist_test, labels_test)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-54bc7bc4aff5> in <module>
      2     G, batch_size, num_samples, head_node_types=['disease', 'protein']
      3 )
----> 4 train_gen = generator.flow(edgelist_train, labels_train, shuffle=True)
      5 test_gen = generator.flow(edgelist_test, labels_test)

/opt/anaconda3/envs/env_stellargraph/lib/python3.6/site-packages/stellargraph/mapper/sampled_link_generators.py in flow(self, link_ids, targets, shuffle, seed)
    154                 ):
    155                     raise ValueError(
--> 156                         f"Node pair ({src}, {dst}) not of expected type ({expected_src_type}, {expected_dst_type})"
    157                     )
    158 

ValueError: Node pair (GO:0044237, GO:0061018) not of expected type (disease, protein)

Would be great to get help on this!

howDareYouSayThat commented 2 years ago

Hi, you got any idea now?

Adityanr commented 1 year ago

Is this resolved?

parisahjb commented 1 year ago

Today, I got the same error. It seems that stellargraph doesn't support heterogeneous graph with more than two types of nodes.

stellargraph / stellargraph

HinSAGE link prediction for heterogeneous graphs with more than 2 node types #2023

Description

User Story