Closed AndChenCM closed 1 year ago
You're absolutely right. It's only counting hydrogens that are explicitly mentioned in the SMILES string. Here's the code I used to generate that table.
from rdkit import Chem
import h5py
from collections import defaultdict
types = defaultdict(int)
infile = h5py.File('SPICE.hdf5')
for group in infile:
try:
smiles = infile[group]['smiles'][0]
mol = Chem.MolFromSmiles(smiles)
num = infile[group]['conformations'].shape[0]
for atom in mol.GetAtoms():
key = (atom.GetAtomicNum(), atom.GetSymbol(), atom.GetFormalCharge())
types[key] += num
except:
print(group)
for key in sorted(types):
print(f'{key[1]}\t{key[2]}\t{types[key]}')
I just added the line
mol = Chem.rdmolops.AddHs(mol)
to make it build implicit hydrogens, and it now gives a much more reasonable number: 15,207,949.
Sorry about that, and thanks for catching it!
Thanks for confirming that! This makes much more sense now.
Hi,
I notice that there are about 1 million molecules in SPICE dataset, but only 1594 instances of ('H', 0) as shown in Table 2 in your published paper. The number '1594' seems unusually low to me. I wonder during DFT calculations, did you use explicit hydrogens for some molecules and implicit hydrogens for the rest? Or '1594' just means the explicit 'H's that are shown in smiles?