Extract some information from results of `opt` and `scrf` Gaussian jobs

FanwangM commented 3 years ago

I am working on a project that requires the following information from geometry optimization result (log file) using Gaussian 16,

dipole (line 20341-20342 in test_opt.log)
electronic spatial extent (au) (line 20339 in test_opt.log)
HUMO and LUMO, (line 20183-20184 in test_opt.log)
rotational constants (GHZ) (line 20156 in test_opt.log)

and the following from SCRF job,

SMD-CDS (non-electrostatic) energy (line 824 in test_scrf.log)
line 605 to line 615 in test_scrf.log

I know you are working heavily on database construction that you may have some unmerged implementation for this already. To just avoid duplicated work, can you share your work if you have some already? @leila-pujal @FarnazH Thanks! Otherwise, I can try to implement this feature and try to merge it into IOData.

test_opt.log test_scrf.log

FarnazH commented 3 years ago

@fwmeng88 actually, I was just needing to get some geometry information out of the Gaussian log file, so it would be very useful to parse more information out of the log file. As far as I can tell, we don't have any unmerged implementation, so please feel free to share your work.

FanwangM commented 3 years ago

Thanks for letting me know. @FarnazH I have a working script for now, but I will refactor it and make a PR. The plan is to put all the information into extra section in the dictionary.

Here is the script that I wrote,

import re

import pandas as pd
from iodata.utils import LineIterator

__author__ = "Fanwang Meng @ Ayers Lab"
__date__ = "2021.April.24"
__version__ = "0.0.2"

def extract_qm_log(log_fpath,
                   tag=None,
                   output_fname=None):
    """Extract quantum chemical descriptors from Gaussian optimization log file."""
    lit = LineIterator(log_fpath)

    data_dict = {}
    electro_spat_ext = []
    nuclear_repulsion_energies = []
    # R6Disp:  Grimme-D2 Dispersion energy
    dispersion_energies = []
    # nuclear repulsion after empirical dispersion term
    nuclear_repulsion_dispersion = []

    while True:
        try:
            line = next(lit).strip()
        except StopIteration:
            break
        # dipole moment
        if line.startswith("Dipole moment"):
            line = next(lit).strip()
            dipole_list = line.split()

            data_dict["dipole_x"] = float(dipole_list[1])
            data_dict["dipole_y"] = float(dipole_list[3])
            data_dict["dipole_z"] = float(dipole_list[5])
            data_dict["dipole_total"] = float(dipole_list[7])
        # quadrupole moment
        elif line.startswith("Quadrupole moment"):
            line = next(lit).strip()
            quadropole_list = line.split()
            data_dict["quadropole_xx"] = float(quadropole_list[1])
            data_dict["quadropole_yy"] = float(quadropole_list[3])
            data_dict["quadropole_zz"] = float(quadropole_list[5])
            line = next(lit).strip()
            quadropole_list = line.split()
            data_dict["quadropole_xy"] = float(quadropole_list[1])
            data_dict["quadropole_xz"] = float(quadropole_list[3])
            data_dict["quadropole_yz"] = float(quadropole_list[5])
        # electronic spatial extent (au)
        elif line.startswith("Electronic spatial extent"):
            electro_spat_ext.append(float(line.split()[-1]))
        # this is used to reset the list to be empty to store the last record
        elif line.startswith("Population analysis using the SCF Density"):
            alpha_occ_eigenvalues = []
            alpha_virt_eigenvalues = []
        # The last value in the Alphha  Occ. eigenvalues gives the HOMO energy and the first
        # value in the Alpha Virt. eigenfunction gives LUMO energy.
        # HOMO
        elif line.startswith("Alpha  occ. eigenvalues --"):
            alpha_occ_eigenvalues.extend(line.split()[4:])
        # LUMO
        elif line.startswith("Alpha virt. eigenvalues --"):
            alpha_virt_eigenvalues.extend(line.split()[4:])
        # rotational constants
        elif line.startswith("Rotational constants (GHZ)"):
            rotational_constants = line.split()[3:]
            data_dict["rot_const_x"] = float(rotational_constants[0])
            data_dict["rot_const_y"] = float(rotational_constants[1])
            data_dict["rot_const_z"] = float(rotational_constants[2])
        # symmetry point group
        elif line.startswith("Full point group"):
            data_dict["point_group"] = line.split()[3]
        # nuclear repulsion energy in Hartrees
        elif line.startswith("nuclear repulsion energy"):
            nuclear_repulsion_energies.append(line.split()[-2])
        # R6Disp:  Grimme-D2 Dispersion energy in Hartrees
        elif line.startswith("R6Disp:  Grimme-D2 Dispersion energy"):
            dispersion_energies.append(line.split()[-2])
        # nuclear repulsion after empirical dispersion term
        elif line.startswith("Nuclear repulsion after empirical dispersion term"):
            nuclear_repulsion_dispersion.append(line.split()[-2])
        # PCM non-electrostatic energy
        elif line.startswith("PCM non-electrostatic energy"):
            data_dict["PCM_non_electrostatic_energy"] = float(line.split()[-2])
        # nuclear repulsion after PCM non-electrostatic terms
        elif line.startswith("Nuclear repulsion after PCM non-electrostatic terms"):
            data_dict["nuclear_repulsion_after_pcm"] = float(line.split()[-2])
        # KE, PE and EE
        elif line.startswith("KE="):
            data_dict["KE"] = float(line.split()[1].replace("D", "e"))
            data_dict["PE"] = float(line.split()[2].split("=")[-1].replace("D", "e"))
            data_dict["EE"] = float(line.split()[-1].replace("D", "e"))
        # SMD-CDS (non-electrostatic) energy, kcal/mol
        elif line.startswith("SMD-CDS (non-electrostatic) energy"):
            data_dict["SMD-CDS"] = float(line.split()[-1])
        # GePol: Number of generator spheres
        elif line.startswith("GePol: Number of generator spheres"):
            data_dict["GePol_num_gen_spheres"] = int(line.split()[-1])
        # GePol: Total number of spheres
        elif line.startswith("GePol: Total number of spheres "):
            data_dict["GePol_total_num_spheres"] = int(line.split()[-1])
        # GePol: Number of exposed spheres
        elif line.startswith("GePol: Number of exposed spheres"):
            data_dict["GePol_num_exposed_spheres"] = int(re.split("=|\(", line)[1])
        # GePol: Number of points
        elif line.startswith("GePol: Number of points                             ="):
            data_dict["GePol_num_points"] = int(line.split()[-1])
        # GePol: Average weight of points
        elif line.startswith("GePol: Average weight of points"):
            data_dict["GePol_average_weight"] = float(line.split()[-1])
        # GePol: Minimum weight of points
        elif line.startswith("GePol: Minimum weight of points"):
            data_dict["GePol_minimum_weight"] = float(line.split()[-1].replace("D", "e"))
        # GePol: Minimum weight of points
        elif line.startswith("GePol: Maximum weight of points"):
            data_dict["GePol_maximum_weight"] = float(line.split()[-1].replace("D", "e"))
        # GePol: Number of points with low weight
        elif line.startswith("GePol: Number of points with low weight"):
            data_dict["GePol_num_points_low_weight"] = int(line.split()[-1])
        # GePol: Fraction of low-weight points (<1% of avg)
        elif line.startswith("GePol: Fraction of low-weight"):
            data_dict["GePol_frac_low_weight"] = float(line.split()[-1].strip('%')) / 100
        # GePol: Cavity surface area, ang**2
        elif line.startswith("GePol: Cavity surface area"):
            data_dict["GePol_cavity_surface"] = float(line.split()[-2])
        # GePol: Cavity volume, ang ** 3
        elif line.startswith("GePol: Cavity volume"):
            data_dict["GePol_cavity_volume"] = float(line.split()[-2])

    data_dict["electro_spat_ext"] = electro_spat_ext[-1]
    data_dict["HOMO"] = float(alpha_occ_eigenvalues[-1])
    data_dict["LUMO"] = float(alpha_virt_eigenvalues[0])
    data_dict["grimme_D2_dispersion_energy"] = float(dispersion_energies[-1])
    data_dict["nuclear_repulsion_energy"] = float(nuclear_repulsion_energies[-1])
    data_dict["nuclear_repulsion_dispersion"] = float(nuclear_repulsion_dispersion[-1])

    if tag is not None:
        data_dict = {k + "_" + tag: v for (k, v) in data_dict.items()}

    df = pd.DataFrame(data_dict, index=[0])
    if output_fname:
        if output_fname.endswith(".csv"):
            df.to_csv(output_fname, sep=",", index=None)
        elif output_fname.endswith(".xlsx") or output_fname.endswith(".xls"):
            df.to_excel(output_fname, index=None)

    return data_dict, df

tovrstra commented 3 years ago

@fwmeng88 This would be a welcome addition!

I have a small question, bu it is likely not an issue: long chained elif statements can become slow because each line is compared in all elif statements. I'm actually not sure how to do it in any faster way. Dictionary lookups could be faster to select the right code for parsing a line, but one would need to know which part of the line to use, e.g.

def func1(line):
    # do something with line
    ...
def func2(line):
    # do something different with line
    ...
funcs = {"begin1": func1, "begin2": func2}
funcs[line[:6]](line)

This would not have a cost that scales linearly with the number of such functions. The problem with my suggestion is that I don't see how to combine it with line.startswith.

FanwangM commented 3 years ago

Thanks for the suggestions. @tovrstra

According to my usage experience, this parsing is fast, ~5-10 seconds. I think I am just going to follow this style but can fix it when it becomes a bottleneck.

theochem / iodata

Extract some information from results of `opt` and `scrf` Gaussian jobs #278