pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.19k stars 3.64k forks source link

GCN with BERT embeddings as Node Features #2765

Closed mahadafzal closed 3 years ago

mahadafzal commented 3 years ago

Following the Google Collab example here: https://colab.research.google.com/drive/14OvFnAXggxB8vM4e8vSURUp1TaKnovzX?usp=sharing#scrollTo=ntt9qVFXlk6A,

I was able to implement node classification for my dataset using Bag of Words implementation for node features.

However, I am trying to implement the task with Node Features being tensors of encoded BERT word embeddings instead of BoW. The max sequence length of each sentence (i.e. node feature) is 30. Would I need to add an embedding layer before applying the GCN layers on the input data? My understanding as of now is that the data node features are too small (30) to provide any contextual information, not to mention only contain BERT token ids. If so, could I be directed to an example where the input node features have not been BoW?

Any suggestions would be helpful.

rusty1s commented 3 years ago

I suggest you simply encode your sentences via a pretrained Transformer model using the transformers library. Here is an example on how to do so.

mahadafzal commented 3 years ago

I have done that, but I am running into issues when working with the GCN layer. Each node feature would have to be a 30 x 768 tensor, but the GCN would expect a 1d tensor of 30 tokens, correct?

rusty1s commented 3 years ago

Not sure I understand. You should get back a feature vector for each encoded sentence/word, which you can simply use as node features for your GCN. In case you have 30 nodes, and BERT decodes features as 768-dimensional tensors, this results in a node feature matrix of shape [30, 768] which you can then input into your GNN.

mahadafzal commented 3 years ago

My understanding is that when I tokenize a sentence for a maximum length of, for example, 30 tokens in a sentence, the BERT tokenizer returns a 30x786 tensor, with 30 being the number of tokens in a sentence, and 786 being the length of the feature vector for each token. If there are 3 nodes in my graph, the tensor fed into the GCN model would have a 3 x 30 x 786 dimension, with the dimension for each node feature being 30 x 786.

Since the GCN layer is only expecting a single integer in its in_channels parameter, how would I go about defining a 30 x 786 input channel for the model? Or if that is not possible, would you have a work around in mind?

rusty1s commented 3 years ago

I see. I think what people are commonly doing is to take the average features over all tokens, e.g., feature.mean(dim=0).

mahadafzal commented 3 years ago

Alright! I think another work around would also be to use the pooled output for BERT i.e. the embeddings for the first [CLS] token. That way, the 786 vector would represent contextual embeddings for the whole sequence. Closing this issue, thanks for the help!