materialsvirtuallab / matgl

Graph deep learning library for materials
BSD 3-Clause "New" or "Revised" License
253 stars 59 forks source link

Stochasticity in M3GNet prediction #116

Closed BowenD-UCB closed 1 year ago

BowenD-UCB commented 1 year ago

Hi Shyueping and Kenko

The dgl implentation of M3GNet gives stochastic predictions, while this was not observed with the original M3GNet implementation.

I'm worrying this might associated to some serious bug that needs attention to. Please have a look:

Screen Shot 2023-07-26 at 6 27 54 PM
shyuep commented 1 year ago

Hmm interesting. The MEGNet implementation doesn't suffer from this problem. So it must be in the threebody computations. @kenko911 pls fix this asap.

shyuep commented 1 year ago

I debugged a little. I can confirm that in the forward method, the node_feat and edge_feat are consistent. But the


node_vec = self.readout(g)
vec = torch.hstack([node_vec, state_feat]) if self.inc...```

are stochastic. So there must be some issue with the readout layer. MEGNet does not use the same readout.
kenko911 commented 1 year ago

@BowenD-UCB Thanks a lot for reporting the issue and I will figure out the problem and fix it ASAP.

JiQi535 commented 1 year ago

It seems that M3GNet UP is not having this issue.

Screenshot 2023-07-26 at 7 40 21 PM
shyuep commented 1 year ago

Because the UP is not intensive and so does not call the readout.

shyuep commented 1 year ago

I have confirmed that the stochasticity resulted from the Set2Set in dgl.nn. For instance, I replaced the set2set call in Set2SetReadout by the following code:

            for i in range(3):
                set2set = Set2Set(in_feats, n_iters=self.num_steps, n_layers=self.num_layers)
                print(g.ndata["node_feat"])
                out_tensor = set2set(g, g.ndata["node_feat"])
                print(f"out={out_tensor}")

The out tensor is different all three times, despite there being no change in initialization and the input.

shyuep commented 1 year ago

I just pushed a fix. The trick is to init the Set2Set only once. Every time it is init, it introduces stochasticity.

shyuep commented 1 year ago

However, it should be noted that multiple loads of the model will still lead to stochasticity. I am not sure how to solve that. I am reopening this issue for now.

shyuep commented 1 year ago

@kenko911 I think you need to look into whether there is some random seed somewhere that needs to be fixed for dgl.nn for the Set2Set initialization. It is rather strange it is not deterministic.

shyuep commented 1 year ago

Scratch what I wrote previously. The dgl.seed does not do anything. Right now, the stochasticity with single model loading is solved. But the stochasticity with multiple model loadings is not.

from pymatgen.core import Lattice, Structure
import matgl
import torch
torch.manual_seed(0)
import random
random.seed(0)
import dgl
dgl.seed(0)
# This is the structure obtained from the Materials Project.
struct = Structure.from_spacegroup("Pm-3m", Lattice.cubic(4.1437), ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
for i in range(10):
    model = matgl.load_model("M3GNet-MP-2018.6.1-Eform")
    eform = model.predict_structure(struct)
    print(f"The predicted formation energy for CsCl is {float(eform.numpy()):.3f} eV/atom.")

still gives stochastic results. Note, the difference between @BowenD-UCB 's code and this one is just the model loading is now in the for loop, so a load is carried out each time. If the load is outside of the for loop, results are now deterministic within that session.

shyuep commented 1 year ago

See this is a known issue with LSTM: https://github.com/pytorch/pytorch/issues/35661

shyuep commented 1 year ago

Note that it seems that if you fix the torch.manual_seed, all stochasticity is gone.

This is the fix that I pushed. I am still not sure why this non-deterministic behavior occurs in m3gnet but not MEGNet. MEGnet seems to work fine without any manual seed set. It may be due to the number of layers in the set2set? M3Gnet uses 3 layers while MEGnet uses just 1.

shyuep commented 1 year ago

@BowenD-UCB Just a note that Kenko pushed a final fix this morning. A new version has been released. The main issue was that the final layer was not in the init and so the weights were not saved. The new code fixes this problem. The only model that was affected was the M3GNet formation energy model. I also added an integration test to make sure it does not recur in future.