openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 8 forks source link

Charges are missing #42

Open raimis opened 2 years ago

raimis commented 2 years ago

The dataset file (https://github.com/openmm/spice-dataset/releases/download/1.0/SPICE.hdf5) doesn't contain the total molecular charge. This could be extracted parsing the SMILES, but it is inconvenient and adds additional burden on the users.

The dataset should provide the complete QM description of a molecule (i.e. elements, positions, charge, and spin state) in a convenient form. The downloader should be modified to add a field with the total charge (and maybe formal charges) for each molecule.

Also, I would suggest including Mulliken charges (Psi4 computes them by default). They could be used to filter "broken" molecules. From my recent experience, the large forces aren't enough to catch them all.

peastman commented 2 years ago

Also, I would suggest including Mulliken charges (Psi4 computes them by default).

I don't think it's possible to add them without recomputing the whole dataset from scratch. Perhaps @pavankum or @dotsdl can confirm that?

pavankum commented 2 years ago

Yeah, I agree with @peastman that we have to recompute from scratch. Although, MBIS charges are available on almost all SPICE sets (except the DES370K supplement). Even the psi4 stdout on the QCA records don't have Mulliken charges printed out since they're calculated post SCF (afaik) and we didn't specify in our inputs to calculate those.

raimis commented 2 years ago

Let's go back to the main issue: how to get the molecular and optionally partial charges. I want to make the SPICE loader in TorchMD-NET (https://github.com/torchmd/torchmd-net/blob/main/torchmdnet/datasets/spice.py) to be able load them.

peastman commented 2 years ago

Having the downloader create a formal_charges field would be reasonable. You can retrieve them like this:

mol = Chem.MolFromSmiles(smiles)
charges = [atom.GetFormalCharge() for atom in mol.GetAtoms()]

If you want MBIS partial charges, you can already store those by including the 'MBIS CHARGES' option in the config file.

jchodera commented 2 years ago

The dataset should provide the complete QM description of a molecule (i.e. elements, positions, charge, and spin state) in a convenient form.

@bennybp: Any chance you have thought about how to represent this information in a common way in your HDF5 files built for machine learning?

jchodera commented 2 years ago

Also, I would suggest including Mulliken charges (Psi4 computes them by default). They could be used to filter "broken" molecules. From my recent experience, the large forces aren't enough to catch them all.

I just wanted to point you to this PR that shows how to identify molecules that have changed connectivity as an alternative to using charges to filter "broken" molecules.

davkovacs commented 1 year ago

I would like to extend the SPICE dataset, and am trying to reproduce some of the QM calculations to ensure my DFT settings are correct.

I would really support adding to the dataset the total charge and spin multiplicity of the molecules at the very least to improve reproducibility of the DFT calculations.

jokpreiksa commented 1 year ago

[davkovacs], you can easily extract this information from smiles using rdkit:

FOR MULTIPLICITY:

def GetSpinMultiplicity(Mol, CheckMolProp = True): Name = 'SpinMultiplicity' if (CheckMolProp and Mol.HasProp(Name)): return int(float(Mol.GetProp(Name)))

# Calculate spin multiplicity using Hund's rule of maximum multiplicity...
NumRadicalElectrons = 0
for Atom in Mol.GetAtoms():
   NumRadicalElectrons += Atom.GetNumRadicalElectrons()

TotalElectronicSpin = NumRadicalElectrons/2
SpinMultiplicity = 2 * TotalElectronicSpin + 1

return int(SpinMultiplicity)

FOR CHARGE:

charge = Chem.GetFormalCharge(molecule)