An example of heterogeneous link prediction via `RandomLinkSplit`

anniekmyatt commented 2 years ago

Hello! I'm thinking of picking this one up, if that's helpful. Did you have a specific dataset to use in mind already?

rusty1s commented 2 years ago

Cool :) I thought about extending the "Loading CSV" tutorial to showcase how one would apply a GNN on this one. I already started integrating the random link split behaviour, see here. The next task would be to create a heterogeneous GNN model, and train it in a supervised fashion against ratings in the training set. WDYT?

anniekmyatt commented 2 years ago

Ah yes, that is indeed a nice start already! :)

This example feels a bit different from "typical" link prediction statements, in that I don't think you can really have a contrastive loss with negative edges, as a missing edge in this graph just means we want to predict what rating there should be for each edge of type ('user', 'rates', 'movie'). So we don't want to train the algorithm to give good separation between "likely" and "unlikely" edges. I think that's fine though, I see this as an edge classification problem and it seems a relevant example, reminiscent of the problem statement of predicting how users would rate products on online stores.

Can you check that my plan for this fits your idea about what you would like?

In the encoder step I perform heterogeneous message passing, maybe using a GNN with 2 layers
In the decoder step I use a DistMult decoder to get scores for each edge label (from 0 to 5), so here I am - kind of - treating the edge labels as six different edge types. If DistMult doesn't do the trick, I could try a bilinear, RESCAL type decoder.
Then I apply a softmax on the 6 scores I got from the decoder per training supervision edge to get something that "looks like" class probabilities, pick the class with the highest probability and use a loss suitable for a multiclass problem statement (such as torch.nn.NLLLoss)
Then backpropagate etc...

I'm happy to try this approach, I just wanted to check if you already had some kind of plan that is quite different from mine, so we don't waste too much time. Thanks! :)

rusty1s commented 2 years ago

You are right, it's more of an edge classification problem in which no negative sampling is needed. Nonetheless, the model should be able to predict the ratings of unknown users/movies. Your approach sounds correct, and matches with the one I have in mind. Let me know how it goes :)

anniekmyatt commented 2 years ago

Alright, just a quick status update: I've put something together and it is learning but the performance is not amazing so I want to improve it a bit. The average test accuracy gets to about 40% after 400 epochs, which I guess is better than random for a 6 class problem but there are a few things I want to try to make it better before sharing it.

I'm afraid I only have time to do this in my evenings so progress is perhaps a bit slow. Hope that's ok.

rusty1s commented 2 years ago

Sure, please feel free to submit a PR early, so I can help with it :)

anniekmyatt commented 2 years ago

Just a quick message to let you know I’m on holiday. I’ll get back to this again next week.

rusty1s commented 2 years ago

No worries, we are not in a rush. Enjoy your free days :)

liyongkang123 commented 2 years ago

Just a quick message to let you know I’m on holiday. I’ll get back to this again next week.

Hello, I am now also studying how to make link predictions on heterogeneous graph. I would like to ask how your project is progressing now? Thank you very much for your reply.

anniekmyatt commented 2 years ago

@liyongkang123: Many apologies for the slow reply! I was on holiday and when I got back work was busy. You can see the progress I've last shared here but I'm not finished yet. I've got a generalisation issue (e.g. the algo works well on the training set but performs poorly on the validation and test sets), and have just started to work again on fixing that.

Is there anything specific that you want to find out about?

pyg-team / pytorch_geometric

An example of heterogeneous link prediction via `RandomLinkSplit` #3217