microsoft / tf-gnn-samples

TensorFlow implementations of Graph Neural Networks
MIT License
914 stars 229 forks source link

QM9 preprocessing Chemical Unit #14

Closed Frank-LIU-520 closed 3 years ago

Frank-LIU-520 commented 3 years ago

I find the target values in /data/qm9/train.jsonl.gz do not match to its real value in the original dataset.

For example, the QM9 id 0000001 molecule should be methane, therefore the first value of dipole should be 0 instead of [-1.7779076]. Besides, all other 11 targets lose its physical meanings.

{"targets": [[-1.7779076], [-7.5946741], [-6.7142577], [2.2468657], [5.355917], [-4.114645], [-3.1489365], [5.7098937], [5.6933656], [5.6850829], [5.7576447], [-6.1835322], [-1.3203824]], "graph": [[0, 1, 1], [0, 1, 2], [0, 1, 3], [0, 1, 4]], "id": "qm9:000001", "node_features": [[0, 1, 0, 0, 0, 6, -0.535689, 0, 0, 0, 0, 0, 1, 0, 4], [1, 0, 0, 0, 0, 1, 0.133921, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.133922, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.13392299, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1, 0.13392299,0, 0, 0, 0, 0, 0, 1, 0]]}

The orginal value of QM9:000001 molecule methane should be as follow:

smiles | mu | alpha | homo | lumo | gap | r2 | zpve | cv | u0 | u298 | h298 | g298 C | 0 | 13.21 | -0.3877 | 0.1171 | 0.5048 | 35.3641 | 0.044749 | 6.469 | -40.4789 | -40.4761 | -40.4751 | -40.4986

What method did you use for QM9 target preprocessing? How can we scale back to the MAE in Chemical units according the paper? Glad if you can help. The units for reference may be shown in [https://arxiv.org/pdf/1712.06113v3.pdf].

mmjb commented 3 years ago

Heya,

Thanks for looking at the code. As you observed, the labels in the stored dataset are indeed transformed from the initial data. Concretely, the applied transformation is a common normalisation in the handling of regression values, namely shifting by the mean and dividing by the stddeviation. See #13 for a detailed discussion of this transformation, and the notion of "chemical accuracy" that is taken from the Gilmer et al paper.

Marc