Closed denjots closed 3 years ago
This needs further research I think. Closing for now.
I've worked quite a bit with this neural network layer now, and I think it's probably the least stable thing I've ever tried to train! It's almost impossible to keep it from exploding, unless you train it with a tiny learning rate. I suspect the algorithm itself is broken, or maybe not yet properly described in the paper. I've tried the latest fix, which helps a bit, but on training 4 layers with random sized graphs of length < 100 nodes, I managed to get past 58 epochs of training (loss going down) before it started exploding to NaN. Looking at the algorithm, I just think adding an unbounded weighted sum of relative coordinates to the existing coordinates is never going to be stable for anything other than a toy problem. The weights needed for 100 nodes are never going to comparable in magnitude to the weights needed for 10 nodes, say. Maybe the original authors meant to normalize the weights in some way, or perhaps norm the relative coordinate vectors - or perhaps they just never tried it on larger messier problems.
@denjots thanks for letting me know! Have you tried any of those solutions you listed? Norming the relative vectors for weighting seems like a good idea
@denjots there's also a lot of other things to try, if you are willing. It just seems too promising a technique to let go entirely
@denjots do you want to try setting norm_rel_coors = True
for your larger experiments? normalizing the coor_weights
may be a good ideas as well
I have to admit I have played with those things and others e.g. using tanh to squash the coordinate weights. It's a balancing act between basically getting it to be stable but unwilling to learn anything and unstable. Neither situation is good, obviously. Adding some normalization works for graphs up to 100, but larger graphs produce NaNs again. Applying more normalization stabilizes it again, but then it learns very very slowly (if it is indeed learning anything). It just feels like a neural net model that really wants to be unstable! Also, I'm a little worried that some tweaks will cause equivariance failures, but I think normalization at least should not do so.
I'm currently trying to rescale the weights by various functions of the graph size - maybe that will work?
@denjots are you using EGGN
or EGGN_sparse
. perhaps only restricting the operation to local neighborhoods on the biomolecule would work better?
@denjots yea, I mean, instability isn't really something foreign to us deep learning folks. it's just a problem to be solved
@denjots yea, i think the three things to try are (1) restricting updates only from local atoms (2) norm relative vectors (3) norm the coordinate weights coming out of the coordinate MLP. none of what i listed will break equivariance
I haven't tried the sparse version - too many dependencies, some that don't install out of the box on my system. I did try some early experiments with a distance dependent mask in the basic EGGN - maybe I will revisit them. It's a shame, because EGGNs looked so simple, but I guess things can be too simple sometimes...
@denjots i wouldn't let up yet, there's another similar paper from recent ICLR that corroborates the results https://github.com/lucidrains/geometric-vector-perceptron I think it is worth trying to localize, or at least clamp the relative vector norms.
the alternative is to use SE3-Transformers, which, although it works, is an order of magnitude slower
@denjots https://github.com/lucidrains/En-transformer I have another version of EGNN that you could try, that is free from the PyG dependencies. you can restrict to local neighbors with num_nearest_neighbors
keyword
I have tried SE-3 transformers - both Fabian's and yours - and they are both stable. Obviously the speed and simplicity of EGGN was very attractive, but you can't have everything. Thanks for the pointer to the EGNN-transformer - I had noticed it, but was keen to get the basic EGNN to work first - I will queue the transformer for testing and see how it goes, however.
@denjots thank you for relaying your results! this matters a lot to me, as if this approach can be made to work, it may make Alphafold2 replication a lot simpler
@denjots I'll ready a version of EGNN without PyG for you to try by later this evening, also restricting to nearest neighbors
@denjots I've added the feature in 0.0.20, so you can use it like so https://github.com/lucidrains/egnn-pytorch/blob/0.0.20/tests/test_equivariance.py#L22-L37
Thanks - will have a look tomorrow.
I think I'm happy enough with where it's at now. I went back to my own masking in the end - mainly becaise the idea of a fixed number of neighbors wasn't very realistic - so I just mask the coordinate weights with a specified distance threshold now. This, combined with the new initialization, stabilizes training up to at least 8 layers now, which is good enough for me. It's not learning much of value so far, but that's likely more a science problem I need to look at than a software problem. Thanks a lot for your help!
Glad to hear you got it working!
Maybe helpful? https://arxiv.org/abs/2009.03294
You are not doing any norm in the eGNN?
I'm going to look into this further... closing.