Closed richardrl closed 5 years ago
You may consider using Jumping Knowledge, maybe this resolves your issues. Otherwise, it is quite hard to judge without any examples.
I am trying to replicate the architecture here, described in page 3, but with additive attention: https://openreview.net/pdf?id=HkxaFoC9KQ
Even with 2 'relational' modules, the gradients seem to vanish. Typically these attentional graph neural network architectures use lots of dense layers so I'm curious how any 'deeper' graph networks could even be trained if my "2 attention deep" architecture is vanishing
Are you sure there is no bug in your implementation? How is your operator implemented?
I am closing this for now. Feel free to reopen in case you want to discuss this further.
I have implemented a simple attention-based graph neural network. It is simply stacked depth-wise relational modules from Santoro 17.
The problem is, my gradients are quickly vanishing in my sparse label setting. I am already using residual connections and layer normalization. What graph architectures here might address this issue?