usnistgov / jarvis

JARVIS-Tools: an open-source software package for data-driven atomistic materials design. Publications: https://scholar.google.com/citations?user=3w6ej94AAAAJ https://www.youtube.com/watch?v=2-XHeC8gbeY
https://pages.nist.gov/jarvis/
Other
313 stars 124 forks source link

What are units and normalization factors in QM9 dataset? #202

Open Nokimann opened 3 years ago

Nokimann commented 3 years ago

I used the following code:

from jarvis.db.figshare import data
d = data('qm9_std_jctc')

The 1st data in QM9 dataset obtained from JARVIS:

{'mu': -1.77790756800166,
 'alpha': -7.59467417670514,
 'HOMO': -6.71425764235072,
 'LUMO': 2.24686567442436,
 'gap': 5.35591684810335,
 'R2': -4.11464477806684,
 'ZPVE': -3.14893653207103,
 'U0': 5.70989371834825,
 'U': 5.69336539320842,
 'H': 5.68508295617329,
 'G': 5.75764468354196,
 'Cv': -6.18353212813309,
 'omega1': -1.3203823354756,
 'SMILES': 'C',
 'SMILES_relaxed': 'C',
 'id': '000001',
 'atoms': {'lattice_mat': [[60, 0, 0], [0, 60, 0], [0, 0, 60]],
  'coords': [[0.4999998496686667, 0.5000001250963333, 0.4999999923633333],
   [0.5002473255336667, 0.481802867173, 0.4998995777733333],
   [0.5170736659886667, 0.5062992418296667, 0.4998712520133333],
   [0.49119790078366665, 0.5060288326963334, 0.48525591384666666],
   [0.4914812580253333, 0.5058689332046666, 0.5149732640033333]],
  'elements': ['C', 'H', 'H', 'H', 'H'],
  'abc': [60.0, 60.0, 60.0],
  'angles': [90.0, 90.0, 90.0],
  'cartesian': False,
  'props': ['', '', '', '', '']}}

And, the original 1st data in QM9 dataset with description:

5
gdb 1   157.7118    157.70997   157.70699   0.  13.21   -0.3877 0.1171  0.5048  35.3641 0.044749    -40.47893   -40.476062  -40.475117  -40.498597  6.469   
C   -0.0126981359    1.0858041578    0.0080009958   -0.535689
H    0.002150416    -0.0060313176    0.0019761204    0.133921
H    1.0117308433    1.4637511618    0.0002765748    0.133922
H   -0.540815069     1.4475266138   -0.8766437152    0.133923
H   -0.5238136345    1.4379326443    0.9063972942    0.133923
1341.307    1341.3284   1341.365    1562.6731   1562.7453   3038.3205   3151.6034   3151.6788   3151.7078
C   C   
InChI=1S/CH4/h1H4   InChI=1S/CH4/h1H4
Line       Content
----       -------
1          Number of atoms na
2          Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3       Frequencies (3na-5 or 3na-6)
na+4       SMILES from GDB9 and for relaxed geometry
na+5       InChI for GDB9 and for relaxed geometry

The properties stored in the second line of each file:

I.  Property  Unit         Description
--  --------  -----------  --------------
 1  tag       -            "gdb9"; string constant to ease extraction via grep
 2  index     -            Consecutive, 1-based integer identifier of molecule
 3  A         GHz          Rotational constant A
 4  B         GHz          Rotational constant B
 5  C         GHz          Rotational constant C
 6  mu        Debye        Dipole moment
 7  alpha     Bohr^3       Isotropic polarizability
 8  homo      Hartree      Energy of Highest occupied molecular orbital (HOMO)
 9  lumo      Hartree      Energy of Lowest occupied molecular orbital (LUMO)
10  gap       Hartree      Gap, difference between LUMO and HOMO
11  r2        Bohr^2       Electronic spatial extent
12  zpve      Hartree      Zero point vibrational energy
13  U0        Hartree      Internal energy at 0 K
14  U         Hartree      Internal energy at 298.15 K
15  H         Hartree      Enthalpy at 298.15 K
16  G         Hartree      Free energy at 298.15 K
17  Cv        cal/(mol K)  Heat capacity at 298.15 K

I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.

I found the units are converted and normalized For example, for homo, lumo, ... Hartree -> eV, and then normalized from the entire data with mean and std

How could I get a unit and mean/std factors for each property?

knc6 commented 3 years ago

Hi,

The QM9 dataset is adapted from GDrive link from Faber et al.. They provide the mean/std in qm9-prop-stats-v1 file and the normalized dataset in qm9-mol-info-standardized-v1 file. The units can be found in Faber et al. (Table 3 and 4), or Choudhary et al. (Table 5).

Nokimann commented 3 years ago

Thank you @knc6 We can't directly load the mean/std from JARVIS now?

gasteigerjo commented 2 years ago

I don't think it's a good idea to provide only standardized data, as it invites the same evaluation error as in ALIGNN. I've observed this confusion between scaled and original data (and inner energy vs. atomization energy) on QM9 in multiple previous papers as well.

It would be great if you would instead provide the data in real units, as done e.g. by PyG: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9