pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.21k stars 3.64k forks source link

Feature embeddings for each user and item are identical #2104

Open sasan73 opened 3 years ago

sasan73 commented 3 years ago

Hello and thank you for this great library,

I am working on a recommendation problem, and have implemented a graph neural network algorithm. For optimization, I have chosen The Bayesian Personalized Ranking method which essentially tries to maximize the difference between the users that have interacted with an item and users that have not. So the new loss function looks like this:

loss = torch.sum(torch.mul(u, p - n)) 
loss = F.logsigmoid(loss)

u are the users. p and n are the positive (interacted) and the negative (did not interact) items.

In order to transform this task into a minimization problem, we have to compute the negative values of the loss and then use an optim method (e.g. gradient descent)

loss_sum += torch.neg(loss)

However, while training, the feature embeddings for each user and item become more similar after each epoch. in the end I have this:

tensor([[-1.6928e+07, -3.8487e+07, 2.8422e+05, ..., -2.1248e+06, 8.2496e+04, -2.3038e+03], [ 1.3135e+08, -3.8487e+07, 6.8238e+05, ..., -9.7038e+06, 8.2496e+04, -5.4265e+03], [ 2.1087e+07, -3.8487e+07, 3.8716e+05, ..., -4.0844e+06, 8.2496e+04, -3.1112e+03], ..., [-2.8774e+07, -3.8487e+07, 2.5270e+05, ..., -1.5249e+06, 8.2496e+04, -2.0567e+03], [-2.8796e+07, -3.8487e+07, 2.5270e+05, ..., -1.5249e+06, 8.2496e+04, -2.0567e+03], [-2.8796e+07, -3.8487e+07, 2.5270e+05, ..., -1.5249e+06, 8.2496e+04, -2.0567e+03]], grad_fn=)

So here is what I think is happening. The reason that the embedding vectors are becoming so much alike is that the model is actually minimizing the loss function (or the difference between the positive and negative items), hence, we get an embedding matrix that has similar rows. However, my confusion is that when I have computed the negative of the loss the model should try to maximize the difference.

Here is my code:

for epoch in range(EPOCH):
  model.train()
  print("epochs: {}/{} ".format(epoch+1, EPOCH))
  t1 = time.time()
  for data in train_ldr:

      data.to(device)

      optimizer.zero_grad()
      xu, xa = model(data)
      print(xu, xa)

      # Sample_BPR

      loss_sum = 0
      for u_index, pos_index, neg_index in Sample_BPR_generator(train.edge_index, group_train, new_frame_train):

        u = xu[u_index]
        p = xa[pos_index]
        n = xa[neg_index]
        # BPR Loss
        loss = torch.sum(torch.mul(u, p - n)) 
        loss = F.logsigmoid(loss)
        loss_sum += torch.neg(loss)

      loss_sum.backward()
      optimizer.step()

Here is how the model is defined I used Pytorch Geometric to calculate the message passing in a graph convolution network.

class GCMCLayer(MessagePassing): 
  def __init__(self, in_channel, out_channel):
    super(GCMCLayer, self).__init__(aggr = 'add')
    self.lin = nn.Linear(in_channel, out_channel)

  def forward(self, x, edge_index, N, M):

    x_first ,x_second = x
    x_first = self.lin(x_first)

    row, col = edge_index

    deg_i = degree(col)
    deg_u = degree(row)
    deg_inv_i = deg_i.pow(-0.5)
    deg_inv_u = deg_u.pow(-0.5)
    norm = deg_inv_u[row] * deg_inv_i[col]

    return self.propagate(edge_index, x=(x_first, x_second), norm = norm, size = (N, M))

    def message(self, x_j, norm):
      return norm.view(-1, 1) * x_j
class DenseLayer(nn.Module):
  def __init__(self, in_channel1, out_channel):
    super(DenseLayer, self).__init__()
    self.lin1 = nn.Linear(in_channel1, out_channel, bias=True)
    self.lin2 = nn.Linear(out_channel, out_channel)
    self.lin3 = nn.Linear(out_channel, out_channel)
    self.batch = nn.BatchNorm1d(out_channel)

  def forward(self, x, hi):

    x = self.lin1(x)
    x = F.relu(x)
    hi = self.lin3(hi)
    x += hi
    return self.batch(F.relu(x))
class GC_encoder(nn.Module):
  def __init__(self, out_channel):
    super(GC_encoder, self).__init__()
    self.gconv_u = GCMCLayer(train.x_i.shape[1], out_channel)
    self.gconv_a = GCMCLayer(train.x_u.shape[1], out_channel)

    self.dense_u = DenseLayer(train.x_u.shape[1], out_channel)
    self.dense_a = DenseLayer(train.x_i.shape[1], out_channel)
  def forward(self, data):

    self.data = data

    xu , xa , edge_index = self.data.x_u, self.data.x_i, self.data.edge_index

    V = self.gconv_a((xu, xa), edge_index, N=xu.shape[0], M=xa.shape[0])
    xa_new = self.dense_a(xa, V)

    U = self.gconv_u((xa, xu), edge_index[torch.tensor([1, 0])], N=xa.shape[0], M=xu.shape[0])    
    xu_new = self.dense_u(xu, U)

    return (xu_new, xa_new)
rusty1s commented 3 years ago

Are you sure your loss computation is correct? It looks like you are taking the sum over all examples, while what I believe you actually want to do is to sum in the last dimension (and taking the mean of it after the sigmoid call). In case, your loss formulation is correct, you may want to try increasing the ratio of negative samples to force the model into distinguished embeddings.

lingchen1991 commented 2 years ago

Have you solved this problem? I used AM-softmax loss + GATConv and encounter the same problem. The 3D embedding for 48 nodes of a graph are identical.

tensor([[-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965], [-59.7394, 1.0199, 0.6965]], device='cuda:1', grad_fn=)

The alpha looks quite odd. It seems that the attention between two concatenated nodes embedding is not working.

image