Vanishing gradient in attention-based Graph Neural Network

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

21.35k stars 3.66k forks source link

Vanishing gradient in attention-based Graph Neural Network #215

Closed richardrl closed 5 years ago

richardrl commented 5 years ago

I have implemented a simple attention-based graph neural network. It is simply stacked depth-wise relational modules from Santoro 17.

The problem is, my gradients are quickly vanishing in my sparse label setting. I am already using residual connections and layer normalization. What graph architectures here might address this issue?

rusty1s commented 5 years ago

You may consider using Jumping Knowledge, maybe this resolves your issues. Otherwise, it is quite hard to judge without any examples.

richardrl commented 5 years ago

I am trying to replicate the architecture here, described in page 3, but with additive attention: https://openreview.net/pdf?id=HkxaFoC9KQ

Even with 2 'relational' modules, the gradients seem to vanish. Typically these attentional graph neural network architectures use lots of dense layers so I'm curious how any 'deeper' graph networks could even be trained if my "2 attention deep" architecture is vanishing

rusty1s commented 5 years ago

Are you sure there is no bug in your implementation? How is your operator implemented?

rusty1s commented 5 years ago

I am closing this for now. Feel free to reopen in case you want to discuss this further.