Difference between code implementation and paper description

Hi,

I found it interesting that, in the paper, it is mentioned that "It is used for node classification on top and two dense layers followed by a softmax layer are used" at the bottom of page 5.

However, the code implementation indicates that only two linear layers with relu nonlinearity instead of two dense layers were used, and the output of the second linear layer is directly compared with the label using cross-entropy. No softmax layer was followed.

X_ = self.linear1(X_)
X_ = F.relu(X_)
y = self.linear2(X_[target_x])
loss = self.loss(y, target)
return loss, y, Ws

I wonder which one I should rely on, the paper description, or the provided code implementation?

seongjunyun / Graph_Transformer_Networks

Difference between code implementation and paper description #22