openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
155 stars 9 forks source link

Reference Energy for Isolated Iodine #107

Closed jhmlam closed 3 months ago

jhmlam commented 3 months ago

Hi Spice authors,

Thank you for maintaining this great dataset, I enjoy every part of it! I have a question on the reference energies of the Iodine containing compounds.

As I understand, it is trying to imitate the nuclei-electron total binding energy of each isolated atom, so the magnitude of it should increase with the atomic number and a simple thomas-fermi model can suffice as an estimation. See my snippet below. You can see that for all other elements in the SPICE dataset except iodine, the TF estimation is pretty good i.e. mostly 1-2% smaller than the reference energy (in Hartree...). However, iodine in the SPICE 2.0.1 dataset seems to have a very small reference energy, only ~4% of the TF energy.

This makes me hesitate a bit on using the iodine compounds. Can anyone enlighten me on this? @peastman Am I wrong at its physical meaning? How was the isolated atom energy calculated?

P.s. I also checked if the basis set is defined for iodine, and it is affirmative. https://psicode.org/psi4manual/master/basissets_byelement.html but I have not looked into their code.

Thank you once again.

Best, Jordy

import tqdm
import numpy as np
import h5py

User_AtomicNumberDeadEnd = 99

def OOC1_Za_ScottTotalIonization(Za):
    # NOTE  J.M.C. Scott (1952) LXXXII. The binding energy of the Thomas-Fermi Atom,
    #       The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 43:343,
    #       859-867, DOI: 10.1080/14786440808520234
    return -1 * (
        (0.7687 * (Za**(7/3)
                   ) - (
                       Za**(6/3)
                       )/2 + 0.221 * (Za**(5/3) 
                           )  + 4* (10**(-6)) * Za**(9/2))) 

Volume_MolPropFloat = {}
Volume_AtomPropInt_ZaInt = {}
Volume_SummedReferenceEnergy = {}
with h5py.File("SPICE-2.0.1.hdf5") as h5:
            idx = 0

            keys = list(h5.keys())

            for h5keys, properties in tqdm.tqdm(sorted(h5.items())): 

                species = np.array(properties['atomic_numbers'], dtype = int)

                energies = np.array(properties['dft_total_energy'], dtype = np.float32) 
                formation_energies = np.array(properties['formation_energy'], dtype = np.float32)
                reference_energies = energies - formation_energies

                List_identifier = list(range(energies.shape[0])) 

                for i_name in List_identifier:
                     row_idx = np.array([i_name,])
                     Volume_SummedReferenceEnergy[idx]= reference_energies[row_idx]
                     Volume_AtomPropInt_ZaInt[idx] = np.array(species,dtype= int)
                     idx +=1

X_SAE = []
y_SAE = []
ScottTotalIonization_Reference = OOC1_Za_ScottTotalIonization(
                                     np.arange(0,User_AtomicNumberDeadEnd))

for idx in tqdm.tqdm(Volume_AtomPropInt_ZaInt.keys()):

          CountOfElements = np.bincount(
              Volume_AtomPropInt_ZaInt[idx], 
              minlength=User_AtomicNumberDeadEnd)
          X_SAE.append(ScottTotalIonization_Reference * CountOfElements)
          y_SAE.append(Volume_SummedReferenceEnergy[idx])
X_SAE = np.array(X_SAE)
y_SAE = np.array(y_SAE)

coefficients, residuals, rank, s = np.linalg.lstsq(X_SAE, y_SAE, rcond=None)

print(coefficients)

The total ionization energy from TF model:

ScottTotalIonization_Reference 
array([-0.00000000e+00, -4.89704000e-01, -2.57572702e+00, -6.85757196e+00,
       -1.37533514e+01, -2.35980925e+01, -3.66765953e+01, -5.32393955e+01,
       -7.35119410e+01, -9.77004304e+01, -1.25995797e+02, -1.58576566e+02,
       -1.95610988e+02, -2.37258677e+02, -2.83671907e+02, -3.34996654e+02,
       -3.91373448e+02, -4.52938090e+02, -5.19822248e+02, -5.92153972e+02,
       -6.70058134e+02, -7.53656813e+02, -8.43069632e+02, -9.38414057e+02,
       -1.03980566e+03, -1.14735836e+03, -1.26118463e+03, -1.38139569e+03,
       -1.50810171e+03, -1.64141193e+03, -1.78143482e+03, -1.92827824e+03,
       -2.08204953e+03, -2.24285565e+03, -2.41080328e+03, -2.58599892e+03,
       -2.76854901e+03, -2.95855995e+03, -3.15613826e+03, -3.36139062e+03,
       -3.57442395e+03, -3.79534548e+03, -4.02426282e+03, -4.26128401e+03,
       -4.50651762e+03, -4.76007275e+03, -5.02205911e+03, -5.29258711e+03,
       -5.57176784e+03, -5.85971318e+03, -6.15653580e+03, -6.46234925e+03,
       -6.77726796e+03, -7.10140731e+03, -7.43488365e+03, -7.77781436e+03,
       -8.13031788e+03, -8.49251374e+03, -8.86452258e+03, -9.24646623e+03])

and it prints the fitting coefficients:

100%|██████████| 113986/113986 [00:27<00:00, 4178.36it/s]
100%|██████████| 2008126/2008126 [00:03<00:00, 668032.73it/s]
[[-6.11086836e-17]
 [ 1.01188943e+00]
 [ 1.52100554e-13]
 [ 1.06382023e+00]
 [-4.43312054e-13]
 [ 1.04584677e+00]
 [ 1.03268779e+00]
 [ 1.02587738e+00]
 [ 1.02184348e+00]
 [ 1.02134619e+00]
 [ 5.49560397e-15]
 [ 1.02236786e+00]
 [ 1.01871124e+00]
 [ 5.79536419e-14]
 [ 1.02030146e+00]
 [ 1.01883405e+00]
 [ 1.01735516e+00]
 [ 1.01603045e+00]
 [ 0.00000000e+00]
 [ 1.01293310e+00]
 [ 1.01029396e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 9.95405450e-01]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 4.19309449e-02]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 0.00000000e+00]]
peastman commented 3 months ago

The reference energies for isolated atoms were computed with this script. The downloader script finds the lowest possible energy for the isolated atoms that make up a molecule, given all possible ways of dividing up the charge among them.

Making a wild guess about why the energy for iodine would be so much lower, I believe this basis set uses effective core potentials for the heavier elements. So the energy of an isolated iodine atom only reflects the binding energy of the outer electrons. Those are also the only ones included in molecular energies, so that's the right reference to subtract.

jhmlam commented 3 months ago

Thank you Peter! I see, I think it makes sense with the ECPs.