Compile NIST database - Githubissues

gabrielasd commented 3 months ago

This issue contributes to the completion of issue #8

To compile this database one needs to place the files: c6cp04533b1.csv database_beta_1.3.0.h5 that are currently in atomdb/data, under a folder raw in atomdb/datasets/nist. (Important, the raw file and its data must not be added to the source control; only developers need access to this)

More generally, to generate the data files with the tools in the api module, the program expect the following folder structure: MYDATAPATH/DATASET/raw where DATASET is the folder for the specific database (the nist folder here), raw is the folder containing the initial files that will be processed to create the standardized database information (what SpeciesData defines) and MYDATAPATH is some path leading to DATASET (basically the path set by the keyword argument datapath shown bellow).

Then the serialized data can be generated with the function compile from the API: atomdb.compile(atnum, charge, mult, 0, database, datapath=mydatapath) The database argument refers to the specific source of raw data, in this case the nist dataset, and the optional datapath argument sets the path to this dataset folder. If you placed the raw data inside the atomdb package (as suggested above) there is no need to specify this variable, it will take the value defined by the environment variable DEFAULT_DATAPATH defined in API. However, it allows to specify a custom path for where to look for the raw files.

One example; to create the MessagePack file for neutral Beryllium atom from the nist raw data (placed in the default path) do:

atnum = 4
charge = 0
mult = 1
database = "nist"
atomdb.compile(atnum, charge, mult, 0, database)

gabrielasd commented 3 months ago

@maximilianvz we do not have currently functionality to compile multiple atomic species at once, so you will need to use a for loop or something similar for this.

To have an idea of the combinations of atom-charge that need to be added you can look at the 6cp04533b1.csv file. In general it goes from H-Lr with charge range from -2 to Z-1.

Figuring out the multiplicity will get tricky (at least for me) once you get to elements in row 4 of the periodic table. For this you can import the dictionary multiplicities from atomdb.utils. You can see a use example here: https://github.com/theochem/AtomDB/blob/e2098b662d0c99967c1c16de6e0f18519b60f90e/atomdb/utils.py#L199-L201

Please let me know any doubts.

maximilianvz commented 3 months ago

@gabrielasd and @msricher , in nist/__init__.py, there are some points where data values are missing, and the way they get dealt with throws things off. For example, Hydrogen with charge -2 doesn't have an Ionization Potential in c6cp04533b1.csv. Therefore, this line assigns None to ip. In the next line, ip gets multiplied by some constants, but if ip = None, this raises an error:

Is a suitable fix for this to add an if ip is not None: clause above that problematic line? Thanks.

maximilianvz commented 3 months ago

@gabrielasd and @msricher, in addition to the above, I have a question about the h5 file:

As explained in the documentation, nist/__init__.py gets the energy for the most stable electronic configuration (charge and multiplicity) from database_beta_1.3.0.h5. There are some cases where data exists for a given configuration in c6cp04533b1.csv, but not in database_beta_1.3.0.h5. For example, Yttrium (atomic number 39) with charge +32 has entries in the CSV file but not the H5 file.

In the case of anions, we just set energy = None, because the H5 file doesn't contain data on these species. In cases like Yt with charge +32, should I just set energy = None like with anions, or should I raise a ValueError like when the ground state multiplicity isn't passed?

gabrielasd commented 3 months ago

@gabrielasd and @msricher , in nist/__init__.py, there are some points where data values are missing, and the way they get dealt with throws things off. For example, Hydrogen with charge -2 doesn't have an Ionization Potential in c6cp04533b1.csv. Therefore, this line assigns None to ip. In the next line, ip gets multiplied by some constants, but if ip = None, this raises an error: Is a suitable fix for this to add an if ip is not None: clause above that problematic line? Thanks.

@maximilianvz nice catch, I missed this I think if we assign the unit conversion here https://github.com/theochem/AtomDB/blob/a09667edb02a53ce990e44777115d31f75e63f99/atomdb/datasets/nist/__init__.py#L133 to a variable, we could apply it in the previous line as a product to the float value https://github.com/theochem/AtomDB/blob/a09667edb02a53ce990e44777115d31f75e63f99/atomdb/datasets/nist/__init__.py#L132 and that line already handles what to do if there is no data for the property (I think)

Also Max, my bad, don't finish compiling yet. I though we had everything we needed merged in already, but we still need PR #51 that changed a few of the properties that are being compiled. I'll do this soon and let you know.

gabrielasd commented 3 months ago

As explained in the documentation, nist/__init__.py gets the energy for the most stable electronic configuration (charge and multiplicity) from database_beta_1.3.0.h5. There are some cases where data exists for a given configuration in c6cp04533b1.csv, but not in database_beta_1.3.0.h5. For example, Yttrium (atomic number 39) with charge +32 has entries in the CSV file but not the H5 file.

In the case of anions, we just set energy = None, because the H5 file doesn't contain data on these species. In cases like Yt with charge +32, should I just set energy = None like with anions, or should I raise a ValueError like when the ground state multiplicity isn't passed?

Yes, to me this also would be the right thing to do to be consistent. What do you think @msricher?

PaulWAyers commented 3 months ago

I think it's good to put anion data in based on the work I did with Carlos and Farnaz. Those aren't from NIST but they are similar-quality experimental data.

If you give me a list of cases where the h5 pile doesn't have the data I'll see what I can figure out. For example, for $\text{Yt}^{+32}$ I can read from the NIST web site that it is a quadruplet ground state (${}^4S$) that is 32,730,000 $\text{cm}^{-1}$ lower in energy than $\text{Yt}^{+33}$

https://physics.nist.gov/cgi-bin/ASD/energy1.pl?de=0&spectrum=yt+32&submit=Retrieve+Data&units=0&format=0&output=0&page_size=15&multiplet_ordered=0&conf_out=on&term_out=on&level_out=on&unc_out=1&j_out=on&lande_out=on&perc_out=on&biblio=on&temp=

gabrielasd commented 3 months ago

Hi @maximilianvz It took a bit longer than expected, but we are now ready to resume the compilation of the datasets!

gabrielasd commented 2 months ago

Michelle has already compiled this DB and uploaded it to the AtomDBdata repo where we are hosting the databases. This issue can be closed.

I opened issue #75 to keep track of the cases I was aware could have failed during compilation.

theochem / AtomDB

Compile NIST database #55