openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 9 forks source link

The instances of ('H', 0) shown in SPICE paper seem unusually low #63

Closed AndChenCM closed 1 year ago

AndChenCM commented 1 year ago

Hi,

I notice that there are about 1 million molecules in SPICE dataset, but only 1594 instances of ('H', 0) as shown in Table 2 in your published paper. The number '1594' seems unusually low to me. I wonder during DFT calculations, did you use explicit hydrogens for some molecules and implicit hydrogens for the rest? Or '1594' just means the explicit 'H's that are shown in smiles?

peastman commented 1 year ago

You're absolutely right. It's only counting hydrogens that are explicitly mentioned in the SMILES string. Here's the code I used to generate that table.

from rdkit import Chem
import h5py
from collections import defaultdict

types = defaultdict(int)
infile = h5py.File('SPICE.hdf5')
for group in infile:
    try:
        smiles = infile[group]['smiles'][0]
        mol = Chem.MolFromSmiles(smiles)
        num = infile[group]['conformations'].shape[0]
        for atom in mol.GetAtoms():
            key = (atom.GetAtomicNum(), atom.GetSymbol(), atom.GetFormalCharge())
            types[key] += num
    except:
        print(group)
for key in sorted(types):
    print(f'{key[1]}\t{key[2]}\t{types[key]}')

I just added the line

        mol = Chem.rdmolops.AddHs(mol)

to make it build implicit hydrogens, and it now gives a much more reasonable number: 15,207,949.

Sorry about that, and thanks for catching it!

AndChenCM commented 1 year ago

Thanks for confirming that! This makes much more sense now.