huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.87k stars 27.2k forks source link

How to embed relational information in a Transformer for NMT task? #18465

Closed smith-co closed 2 years ago

smith-co commented 2 years ago

Feature request

Embedding relational information for a transformer

Motivation

I am using Transformer model form huggingface for machine translation. However, my input data has relational information as shown below:

image

So I have has semantic information using Language Abstract Meaning Representation (AMR) graph in the input graph.

Is there even a way to embed relationship like the above in a transformer model? Is there any model from Huggingface that I can use in this regard?

Your contribution

If a model is developed, I could beta test the model.

sinking-point commented 2 years ago

What is the input to the transformer going to be? Is it more like:

He ended his meeting on Tuesday night.

but with the graph data encoded into the embeddings somehow? Or more like:

end-01 He meet-03 data-entity Tuesday night

with the graph data iteslf as input?

smith-co commented 2 years ago

The graph could be though of like the following:

 ________
 |       |
 |      \|/
He ended his meeting on Tuesday night.
/|\ |    |               /|\
 |  |    |                | 
 |__|    |________________|  

Essentially each token in the sentence is a node and there could be edge embedded between tokens.

sinking-point commented 2 years ago

In a normal transformer, the tokens are processed into token embeddings, then an encoding of each position is processed into an embedding and added to the token embeddings at the corresponding positions. The result is positional embeddings. This is how each position 'knows' where it is in the sequence.

You could do something similar with the edge information. You need some trainable network that takes the edge type and the positional encoding of the target node, combines this information, and outputs an embedding. The embeddings of all the edges can be added to the positional embeddings for the corresponding nodes.

My intuition is that the attention layers could use this encoded information to 'find' related nodes. I don't know how well it will work but that would be my approach. Good luck!

smith-co commented 2 years ago

@sinking-point thanks for your response. So essentially I need to extend the positional embedding generation considering not position in the sentence and instead based on the edge type.

But there could be different types of edges as well. How could that be combined? I suppose there would be a need to use different weight for different types of edge?

Is there any such model implementation with hugging face? I have already have a look but can't find anything.

sinking-point commented 2 years ago

You could combine them like this:

Edge type as one hot vector -> nn.Embedding -> edge type embedding

Index of target node -> positional encoding -> whatever positional embedding method your chosen transformer uses -> target node embedding

Sum = edge type embedding + target node embedding

If we only have a maximum of one edge per node, we can just add this sum to the origin node embedding. However, we might have many edges and if we do this they'll interfere with eachother. We want different edge types to be able to partition themselves into different parts of the vector, so I'd try a multi layer perceptron kinda thing:

Sum (embedding width) -> nn.Linear -> hidden (bigger width) -> activation fn -> nn.Linear -> finished edge embedding

Alternatively, you could take each edge, turn it into an embedding, add embeddings for both the origin and target nodes' positional encodings. Then just append these to the transformer input. There's less complexity in that you don't need the MLP I described, but might be more expensive because attention scales quadratically with length in both time and space.

sinking-point commented 2 years ago

I don't know of any existing transformer that does what you want already.

smith-co commented 2 years ago

@sinking-point thanks for your response. Can I apply this change in a modular fashion?

I suppose I need to augment the following snippet?

positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)

Having said that how could I pass the edge information 🤔

For me the it does not need to be optimized. Have you have any code snippet demonstrating something similar 🙏 ?

sinking-point commented 2 years ago

What transformer do you want to use? Take Bart for example, you can pass in inputs_embeds.

smith-co commented 2 years ago

I would like to use Longformer.

sinking-point commented 2 years ago

I would probably go with my first suggestion then. Putting all the edges at the end might not play well with longformer's local attention.

Longformer also has inputs_embeds as an argument, so you could do something like:

class MyLongformer(nn.Module):
    def __init__(...):
        self.model = LongformerModel(...)
        self.edge_embed = MyEdgeEmbedding(...)

    def forward(...):
        inputs_embeddings = self.model.get_input_embeddings()(input_ids, ...)

        for batch, edge_type_id, origin_idx, target_idx in edges:
            input_embeddings[batch][origin_idx] += self.edge_embed(edge_type_id, target_idx)

        # might be best to normalise here

        return self.model(inputs_embeds=inputs_embeds, ...)
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.