QM9 units mismatch - Githubissues

gasteigerjo commented 3 years ago

Looking at your QM9 dataset I've discovered that the raw data seems to be standardized. Here you seem to provide the targets' standard deviations. Comparing (the inverse of) these with the standard deviations (and converting all energies from kcal to eV) I mostly get a good match with the PyTorch Geometric dataset's standard deviations:

 Target: GNN-FiLM ?= PyG
      μ:    15.03 ?= 1.50
      α:    81.73 ?= 8.18
 ε_homo:     0.60 ?= 0.60
 ε_lumo:     1.29 ?= 1.27
     Δε:     1.29 ?= 1.29
   <R²>:   233.73 ?= 280.61
   ZPVE:    32.58 ?= 0.90
U0_atom:    10.41 ?= 10.33
 U_atom:    10.50 ?= 10.42
 H_atom:    10.58 ?= 10.50
 G_atom:     9.58 ?= 9.51
    c_v:    81.35 ?= 4.06

However, there is a 10x difference in μ and α and <R²> (gap), ZPVE, and c_v don't match at all. What am I missing here?

PS.: There seems to be a misunderstanding of the term chemical accuracy. The chemical accuracy would be a constant associated with the precision of experimental measurements, not the standard deviation . In issue #11 you said that CHEMICAL_ACC_NORMALISING_FACTORS gives the chemical accuracies. Those (and neither their inverse or some other transformation) don't match at all:

 Target: GNN-FiLM ?= MPNN chem. accuracy
      μ:   0.0665 ?= 0.100
      α:   0.0122 ?= 0.100
 ε_homo:   0.0719 ?= 0.043
 ε_lumo:   0.0337 ?= 0.043
     Δε:   0.0335 ?= 0.043
   <R²>:   0.0043 ?= 1.200
   ZPVE:   0.0013 ?= 0.001
     U0:   0.0042 ?= 0.043
      U:   0.0041 ?= 0.043
      H:   0.0041 ?= 0.043
      G:   0.0045 ?= 0.043
    c_v:   0.0123 ?= 0.050
      ω:   0.0375 ?= 10.000

mmjb commented 3 years ago

Hey,

Thanks for investigating this. The provenance of these values makes it a bit hard to follow through on them. The data made its way from Justin Gilmer (after the MPNN paper) to Alex Gaunt to me, and I'm trying to recover the details. Let me summarise the relevant details as I thought them to be:

Chemical accuracy values are indeed as you list, e.g. see Table 3 in https://arxiv.org/pdf/1702.05532.pdf
CHEMICAL_ACC_NORMALISING_FACTORS are the normalising factors that combine undoing dataset normalization and chemical accuracy.

So I tried to verify / reconstruct things now.

Part 1: Target values in the tf-gnn dataset are obtained by normalisation of the raw data with dataset statistics (shift towards mean, divide by stddev): Results starting from the raw data:

>>> data = pd.read_csv("gdb9.sdf.csv")
>>> means, stds = {}, {}
>>> for prop in data.keys()[1:]:
...   means[prop] = np.mean(data[prop])
...   stds[prop] = np.std(data[prop])
>>> data_normalised = data.copy()
>>> for col in data.keys()[1:]:
...   data_normalised[col] = (data[col] - means[col]) / stds[col]
>>> print(data_normalised.iloc[41])
mol_id           gdb_42
A            0.00275349
B               2.70838
C               3.16128
mu              -1.7633
alpha          -5.34594
homo          -0.877637
lumo            1.00725
gap             1.40575
r2             -3.18737
zpve           -1.90397
u0              4.52722
u298            4.52717
h298            4.52717
g298             4.5273
cv             -3.63417
u0_atom         3.75323
u298_atom       3.74152
h298_atom       3.73459
g298_atom       3.79761
Name: 41, dtype: object

Same datapoint in the tf-gnn-sample preprocessed data:

$ zgrep "qm9:000042" ~/AZProjects/tf-gnn-samples/data/qm9/*.jsonl.gz
/home/mabrocks/AZProjects/tf-gnn-samples/data/qm9/train.jsonl.gz:{"targets": [[-1.7729191], [-5.3665919], [-0.87358874], [0.99390996], [1.3954599], [-3.1788111], [-1.9290129], [3.7917917], [3.7800913], [3.7730706], [3.8370471], [-3.6345415], [1.1978674]], "graph": [[0, 1, 1], [0, 1, 4], [1, 1, 2], [1, 1, 5], [1, 1, 6], [2, 1, 3], [2, 1, 7], [2, 1, 8], [3, 1, 9]], "id": "qm9:000042", "node_features": [[0, 0, 0, 1, 0, 8, -0.42366999, 1, 1, 0, 0, 0, 1, 0, 1], [0, 1, 0, 0, 0, 6, -0.064048998, 0, 0, 0, 0, 0, 1, 0, 2], [0, 1, 0, 0, 0, 6, -0.064037003, 0, 0, 0, 0, 0, 1, 0, 2], [0, 0, 0, 1, 0, 8, -0.42366901, 1, 1, 0, 0, 0, 1, 0, 1], [1, 0, 0, 0, 0, 1, 0.28335601, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.096500002, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.107862, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.107862, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.096496999, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.28334999, 0, 0, 0, 0, 0, 0, 1, 0]]}

The targets are in order ["mu", "alpha", "homo", "lumo", "gap", "r2", "zpve", "u0_atom", "u298_atom", "h298_atom", "g298_atom", "cv", "omega1"] and match with small differences, which I believe stem from different means/stds (because the above computes them on the entire data, and I believe for the normalisation in this dataset, they were computed only on the training data).

Part 2: What's going on with the chemical accuracies? My initial thought that these are the "raw" chemical accuracies, but to apply them directly to the normalised data, they also need to reflect the same normalisation (i.e., division by std dev), and so the actual chemical accuracy should be recoverable by multiplying these values with the std deviation:

>>> CHEMICAL_ACC_NORMALISING_FACTORS = [0.066513725,0.012235489, 0.071939046, 0.033730778, 0.033486113, 0.004278493, 0.001330901, 0.004165489, 0.004128926, 0.00409976, 0.004527465, 0.012292586, 0.037467458]
>>> CHEMICAL_ACC_PROP_NAMES = ["mu", "alpha", "homo", "lumo", "gap", "r2", "zpve", "u0_atom", "u298_atom", "h298_atom", "g298_atom", "cv"]
>>> CHEMICAL_ACC_FACTORS = dict(zip(CHEMICAL_ACC_PROP_NAMES, CHEMICAL_ACC_NORMALISING_FACTORS))
>>>
>>> {prop: v*stds[prop] for prop, v in CHEMICAL_ACC_FACTORS.items()}
{'mu': 0.10179182526130452, 'alpha': 0.10018127462698645, 'homo': 0.001592108304364573, 'lumo': 0.001583184240495608, 'gap': 0.001591216906359044, 'r2': 1.196934632100915, 'zpve': 4.428409661356832e-05, 'u0_atom': 0.9968535017694686, 'u298_atom': 0.9968680818259666, 'h298_atom': 0.9968572876830386, 'g298_atom': 0.9969761602650904, 'cv': 0.04993809286577978}

So now, mu, alpha, R2 and cv look like near-perfect matches now, but everything else is off.

But then it turns out that CHEMICAL_ACC_FACTORS["homo"] * stds["homo"] * HAR2EV comes out to 0.04332 (where we were hoping for 0.043, so check, and the same holds for lumo, gap and zpve. [And BTW, 0.043eV/mol is a complicated way of writing 1 kcal/mol]. The remaining values ('u0_atom', ... 'g298_atom') seem to have a target accuracy of 1?

So now we have an explanation for all values, the question is just if they are correct. Table 3 in https://www.nature.com/articles/sdata201422 indicates what the units in the original data dump were. I've checked the example 000042 and compared with data.iloc[41], and indeed, gdb9.sdf.csv has the raw values and no transformations are applied at this stage. Hence, all labels are stored in the units as in Table 3 of https://www.nature.com/articles/sdata201422.

Now, https://arxiv.org/pdf/1702.05532.pdf Tab. 3 gives all target accuracies in eV, and not in Ha, and so the target accuracy needs to be converted. For example for homo, we get 0.043 / HAR2EV == 0.00158, which conveniently is the value you get from CHEMICAL_ACC_NORMALISING_FACTORS when reverting the stddev-normalisation.

So I believe this is correct: As the labels are stored and processed in normalised Ha, we need to convert the chemical accuracy values from eV to "normalised Ha" as well, which is what CHEMICAL_ACC_NORMALISING_FACTORS does. The remaining question mark is over the normalisation for u0_atom ... g298_atom - I'm not sure where the target accuracy stems for these...

Does that clear up most of your questions? I'm not entirely sure what you were after, and so not sure if this helps you in a meaningful way...

gasteigerjo commented 3 years ago

That is indeed very helpful, thank you for the very extensive answer! To translate your reconstruction to my approach, we should have

chem_accuracy / CHEMICAL_ACC_NORMALISING_FACTORS = std_PyG.

And indeed we do!

 Target: GNN-FiLM ?= PyG std. dev.
      μ:    1.503 ?= 1.501
      α:    8.173 ?= 8.176
 ε_homo:    0.598 ?= 0.601
 ε_lumo:    1.275 ?= 1.273
     Δε:    1.284 ?= 1.286
   <R²>:  280.473 ?= 280.606
   ZPVE:    0.902 ?= 0.902
U0_atom:   10.323 ?= 10.332
 U_atom:   10.414 ?= 10.424
 H_atom:   10.488 ?= 10.498
 G_atom:    9.498 ?= 9.506
    c_v:    4.067 ?= 4.057

Full code

```python path = ".../PyG_data/QM9" qm9_pyg = torch_geometric.datasets.QM9(path) pyg_std = qm9_pyg.data.y.std(0).numpy() targets_pyg = ['μ', 'α', 'ε_homo', 'ε_lumo', 'Δε', '', 'ZPVE', 'U0', 'U', 'H', 'G', 'c_v', 'U0_atom', 'U_atom', 'H_atom', 'G_atom', 'A', 'B', 'C'] chem_accuracy_ev = OrderedDict({ 'μ': 0.1, 'α': 0.1, 'ε_homo': 0.043, 'ε_lumo': 0.043, 'Δε': 0.043, '': 1.2, 'ZPVE': 0.0012, 'U0': 0.043, 'U': 0.043, 'H': 0.043, 'G': 0.043, 'c_v': 0.05, 'ω': 10.0, }) chem_accs = list(chem_accuracy_ev.values()) CHEMICAL_ACC_NORMALISING_FACTORS = [0.066513725, 0.012235489, 0.071939046, 0.033730778, 0.033486113, 0.004278493, 0.001330901, 0.004165489, 0.004128926, 0.00409976, 0.004527465, 0.012292586, 0.037467458] print(" Target: GNN-FiLM ?= PyG std. dev.") pyg_idx = [0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15, 11] for itarget, ipyg in enumerate(pyg_idx): conversion = chem_accs[itarget] / CHEMICAL_ACC_NORMALISING_FACTORS[itarget] print(f"{targets_pyg[ipyg]:>7}: {conversion: 8.3f} ?= {pyg_std[ipyg]:.3f}") ```

Thank you again for your help in reconstructing this conversion!

mmjb commented 3 years ago

Sure, happy to help!

microsoft / tf-gnn-samples

QM9 units mismatch #13