pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.23k stars 3.64k forks source link

Loading data from dataframe to bipartite graph #1999

Open LucasSDresl opened 3 years ago

LucasSDresl commented 3 years ago

❓ Questions & Help

I am new into pytorch_geometric and i am trying to use BipartiteData class to load data from dataframe which looks something like this:

image

weight_of_edge : means how many times customer_id order from vendor_id

Wanted to know if I passing correctly the data from my dataframe to the BipartitData class ?

This are my variables defined from dataframe to fed BipartitData: weight = (torch.Tensor(df['weight_of_edg'].values)).long() customer_id = (torch.Tensor(df['customer_id'].values)).long() vendor_id = (torch.Tensor(df['vendor_id'].values)).long() edge_index = torch.Tensor((np.vstack((customer_id, vendor_id)))).long()

Finally i passed this way: data = BipartiteData(edge_index, customer_id, vendor_id) data.edge_attr = weight

Is this okay? Thanks you very much ! keep with the excellent job :) !

rusty1s commented 3 years ago

That looks mostly correct to me except for a few things:

  1. I suggest converting weight to a float tensor
  2. Your customer and vendor ids should ideally start from 0 and go up to N -1 (where N denotes the number of uinique customer/vendors). You can convert your ids to such a format via torch.unique(return_inverse=True).
LucasSDresl commented 3 years ago

My corrections:

  1. weight = (torch.Tensor(df['weight_of_edg'].values)).float()
  2. c, cidx = torch.unique(input=customer_id, return_inverse=True) v, vidx = torch.unique(input=vendor_id, return_inverse=True) edge_index = torch.Tensor((np.vstack((cidx, vidx)))).long() x_s = cidx.unique() x_t = vidx.unique() data = BipartiteData(edge_index, x_s=x_s, x_t=x_t) data.edge_attr = weight

Looks correct ? if i want to pass node_features how i should pass it ? can u plz reference me some example (if you have) of how to pass a biparthite data into some architecture ?

rusty1s commented 3 years ago

Dependent on the size of your data, a natural way to handle feature-less graphs is to encode the feature matrix as an identity matrix. Does that work for you?

LucasSDresl commented 3 years ago

size of the data is not big ( i am creating a random data for learning ). I was thinking on passing 4/5 feature for vendor_id and customer_id and use some gcn or gnn archichecture for a biparthite graph ( i didnt saw in the examples section an example using biparthite graphs). Is okay if i want to recommend to customer i some vendor, treat the problem as a link prediction problem for this case ? how i should encode graph for biparthite ?

rusty1s commented 3 years ago

If that is the case, I would go for the identity matrix as input node feature matrix:

edge_index_T = torch.stack([edge_index[1], edge_index[0]], dim=0)  # Transposed/Reversed graph.

data.customer = torch.identity(num_customers)
data.vendor = torch.identity(num_vendors)

conv1 = SAGEConv((num_customers, num_vendors), 64)
new_vendor_x = conv1((data.customer, data.vendor), edge_index).relu()

conv2 = SAGEConv((num_vendors, num_customers), 64)
new_customer_x = conv2((data.vendor, data.customer), edge_index_T).relu()

# Repeat with new_vendor_x and new_customer_x:
conv3 = SAGEConv((64, 64), 128)
new_vendor_x2 = conv3((new_customer_x, new_vendor_x), edge_index).relu()
# ...

For the final link prediction, it's a good idea to compute edge representations based on the hidden node representations after a number of convolutions:

edge_attr = torch.cat([customer_x[edge_index[0]], vendor_x[edge_index[1]], dim=-1)
prediction = MLP(edge_attr)
LucasSDresl commented 3 years ago

Thanks @rusty1s ! I have some question regarding ur last comment.

  1. When you say inside edge_attr, vendor_x you mean new_vendor_x or new_vendor_x2 ? here u are using the node representations for representing edge right ?
  2. inside data.edge_attr i was having the magnitude of the edge ( how many times customer_id order from vendor_id ) , how is the correct way to represent those weight into the edge representation? what i want to get from weight is if two customer order a lot of times from the same vendor, represent those two nodes more similar between them.
rusty1s commented 3 years ago
  1. Yes, you use the hidden node representations of source and target nodes to make edge predictions.
  2. You can either represent it as a continuous edge_weight from range 0 to 1 which controls the aggregation or you can use it as an additional attribute to craft messages:
    def message(self, x_i, x_j, edge_attr):
    return torch.cat([x_i, x_j, edge_attr)

    It's also a good idea to include it in the final edge representation to make predictions.

vicbeldo98 commented 2 years ago

Hi @LucasSDresl, I am more or less trying to do the same as you. Did you find any example on how to train with a Bipartite Graph, or do you have any implementation available?

Thank you very much!

rusty1s commented 2 years ago

Have you looked into our "Heterogeneous Graph Learning" tutorial?

vicbeldo98 commented 2 years ago

I haven't seen it! That was super helpful.

However I still don't know how to create a model with that dataset to recommend items. In that link it gives you three ways, but i cann't realize what num_classes is in my example. I don't want to classify, and i am a bit confused with that.

Thank you

rusty1s commented 2 years ago

You can also take a look at our heterogeneous link prediction example, which is probably more related to the task you are trying to solve. Let me know if there exists any further questions.

vicbeldo98 commented 2 years ago

That is a perfect example! Thank you very much

vicbeldo98 commented 2 years ago

As I experiment with the example, more and more questions arise!

In the examples shown, how would you make a recommendation of a movie to a user? I am aware that the task is link prediction so I guess I should make a prediction of every user-movie possible edge and take the one with the highest predicted label, but I am unsure of how to do it.

I have been looking at the way the test is made: image

You pass the model the initial embeddings, the connections of the graph and an edge_label_index, which i am also unsure of what it is (as it seems the same as the connections of the graph?).

Thank you very much for the patience!

rusty1s commented 2 years ago

The edge_label_index denotes all the connections for which you want to obtain a prediction/rating for movie/user pairs. As such, for predicting new movies to a user, you would need to set edge_label_index to something similar than:

row = torch.tensor([user_id] * num_movies)
col = torch.arange(num_movies)
edge_label_index = torch.stack([row, col], dim=0)
vicbeldo98 commented 2 years ago

Following your advice, I have been able to provide predictions. Nevertheless, it always recommends the same movie (to all the users i have tried). Am I doing something wrong or is it because the model is simple? (I also tried trainning with 10000 epochs but results are the same)

movie_mapping = {i: idx for i, idx in enumerate(df_movies.index)}
num_movies = len(data['movie'].x)
row = torch.tensor([USERID] * num_movies)
col = torch.arange(num_movies)
edge_label_index = torch.stack([row, col], dim=0)
pred = model(data.x_dict, data.edge_index_dict, edge_label_index)
pred = pred.clamp(min=0, max=5)
idx_max = torch.argmax(pred)
movieId = movie_mapping[int(idx_max)]

Thank you very much for your help

rusty1s commented 2 years ago

What happens if you look at the topk of movie predictions?

vicbeldo98 commented 2 years ago

Okay! I don't really know why but it is working now, thank you very much!

Last thing, what happens if a user is not connected to any movie or vice versa?

rusty1s commented 2 years ago

It's embedding is not trained, and the recommendation is likely going to be random :)

vicbeldo98 commented 2 years ago

Perfect! Makes sense! Thank you very much for all your help :)

vicbeldo98 commented 2 years ago

Two last questions:

  1. How could I make a recommendation for a user outside the system? I mean, imagine we have a user, which we know some links with the movies, even though it is new and has not been trained. Would the only possible way to recommend be retrain the model with the new graph?

  2. I understand that GNN are like passing the information through neighbours. And I know there exists more or less three types of recommendation: content based, collaborative based, and hybrid. My guess would be that this system is hybrid because somehow movies adapt to users and vice versa, but I am not really sure. Is there any paper I could read on this subject?

Thank you very much! I am learning a lot, but it is very difficult

rusty1s commented 2 years ago

Sorry for the late reply.

  1. If you rely on learned embeddings for certain node types, you have to train any new embedding as well. You can either do that by re-training your model with the new node (with can become expensive), or only train the embedding for the new node while holding all other parameters fixed.
  2. Yes, I think this is indeed hybrid as well, but I'm not aware of any paper explaining that in detail.
vicbeldo98 commented 2 years ago

Thank you very much for your response! I will get into it