graphdeeplearning / graphtransformer

Graph Transformer Architecture. Source code for "A Generalization of Transformer Networks to Graphs", DLG-AAAI'21.
https://arxiv.org/abs/2012.09699
MIT License
872 stars 134 forks source link

Detail on softmax #4

Closed DevinKreuzer closed 3 years ago

DevinKreuzer commented 3 years ago

Great work!

I have a question concerning the implementation of softmax in the graph_transformer_edge_layer.py

When you define the softmax, you use the following function:

def exp(field):
    def func(edges):
        # clamp for softmax numerical stability
        return {field: torch.exp((edges.data[field].sum(-1, keepdim=True)).clamp(-5, 5))}
    return func

Shouldn't the attention weights/scores be scalars? From what I see, each head has an 8-dimensional score vector which you then compute .sum() on. The graph_transformer_layer.py layer does not have this .sum() function.

def scaled_exp(field, scale_constant):
    def func(edges):
        # clamp for softmax numerical stability
        return {field: torch.exp((edges.data[field] / scale_constant).clamp(-5, 5))}

    return func

Would appreciate any clarification on this :)

Best, Devin

vijaydwivedi75 commented 3 years ago

Hi @DevinKreuzer,

The .sum() is done here in graph_transformer_layer.py. https://github.com/graphdeeplearning/graphtransformer/blob/3c83b4ba5e45a2e25bbefde1b35d88a27ca3cfb2/layers/graph_transformer_layer.py#L18-L19

@DevinKreuzer: Shouldn't the attention weights/scores be scalars? From what I see, each head has an 8-dimensional score vector

https://github.com/graphdeeplearning/graphtransformer/blob/3c83b4ba5e45a2e25bbefde1b35d88a27ca3cfb2/layers/graph_transformer_edge_layer.py#L44-L48

Hope this helps for understanding the implementation. Vijay

vijaydwivedi75 commented 3 years ago

Closing the issue for now. Feel free to open for any (further) clarification.