openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 9 forks source link

Reproducing SPICE DFT values #54

Closed davkovacs closed 1 year ago

davkovacs commented 1 year ago

I am trying to reproduce the SPICE DFT values and am getting on some configs quite significant differences in energy.

Could someone please help me what the problem with my setup is? I installed Psi4 1.4.1 from conda and am using python 3.8. Below moll is a qcelemental.models.Molecule object of a molecule from SPICE PubChem subset.

import qcengine as qcng
from qcelemental.models.common_models import Model
from qcelemental.models import AtomicInput

model = Model(method="wB97M-D3BJ", basis="def2-TZVPPD")
qc_task = AtomicInput(molecule=moll, driver="energy", model=model)
result = qcng.compute(input_data=qc_task, program="psi4", task_config={"memory":4, "ncores":8})
energy = result.return_result

An example molecule where I see a very significant discrepancy between my and SPICE number is the attached molecule. Here the energy difference is ~0.25 kcal / mol. Units are Angstrom, eV and eV / A in the file below converted from SPICE. bad_mol.xyz.txt

pavankum commented 1 year ago

Hi @davkovacs, can you please check again adding keywords={'wcombine': False} to your Model(). I'm assuming you are using the coordinates from the latest release tar ball.

davkovacs commented 1 year ago

Adding that did not seem to have changed the results.

davkovacs commented 1 year ago

Could someone who did the original calculations / has access to the calculation scripts perhaps try and recompute just that one molecule and share the exact script that needs to be used to obtain the published energies and gradients?

davkovacs commented 1 year ago

Here is a notebook with a bit more detailed analysis that you can try and run, it just requires the SPICE.hdf5 file that can be downloaded.

The energies agree to an acceptable degree in the example, but the forces are so far off that something must be really wrong.

spice_dft.tar.gz

pavankum commented 1 year ago

@davkovacs I think the hdf5 file has energies and forces stored in np.float32 format and the values are truncated,

https://github.com/openmm/spice-dataset/blob/1e94352be364e993c8bef58303a4f079bc6b8b32/downloader/downloader.py#L115

This seems to be source of difference in your notebook, you can use the downloader script for better precision.

>>> energy = -8886.235809450398
>>> numpy.float32(energy)
-8886.235
>>>

I am attaching the psi4 output from which these values were obtained (it includes the input as well) for the molecule you posted in your first post. Re-running the input on a different node architecture gives me an energy within 1e-07 hartree. It also has the gradient information. Please let me know if this answers your question. qca_id_106004569_psi_stdout.txt

davkovacs commented 1 year ago

Shouldn't the default be float64? Float32 means that the target energy error range for ML potentials is basically off-precision.

davkovacs commented 1 year ago

I have managed to reproduce all the numbers now including energies and gradients. But I highly recommend changing the default download format to float64.

jchodera commented 1 year ago

@davkovacs : Thanks so much for identifying this issue! Our assumption was that subtracting the reference energies would have been sufficient to deal with this underflow precision issue, but I think you're right---we should switch to float64 (and likely repeat our training experiments on SPICE).

giadefa commented 1 year ago

Other datasets do use float64

On Mon, Nov 14, 2022 at 1:23 PM John Chodera @.***> wrote:

@davkovacs https://github.com/davkovacs : Thanks so much for identifying this issue! Our assumption was that subtracting the reference energies https://github.com/openmm/spice-dataset/blob/main/downloader/downloader.py#L115 would have been sufficient to deal with this underflow precision issue, but I think you're right---we should switch to float64 (and likely repeat our training experiments on SPICE).

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/issues/54#issuecomment-1313607135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUORE4CDG5EDAT3QC2ALWIIVLRANCNFSM6AAAAAAR4453UQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

peastman commented 1 year ago

we should switch to float64 (and likely repeat our training experiments on SPICE).

Agreed we should store in fp64 by default. But this will have no effect at all on any of our experiments. We should be so lucky as to have a model that could match the training data to 1e-7! The training error is nowhere close to the point where errors at that level make a difference.

davkovacs commented 1 year ago

That's right, this is a problem only if you look at total energies. If the interaction energy is in float32 that should be fine.

On a related but slightly different note, we do observe that models that use float32 are not as reliable as the float64 models (in particular smoothness problems and geometry optimisation working less well) and this observation seems to stand across multiple different model architectures.