store_wavefunction not propagated

When creating an OptimizationDataset, I would expect store_wavefunction to get propagated into the generated QCSpecification. However, I don't think this is currently happening -- should it be?

Code

from openff.qcsubmit.datasets import OptimizationDataset, BasicDataset
from qcelemental.models.results import WavefunctionProtocolEnum

dataset = OptimizationDataset(
    dataset_name="OpenFF multi-Br ESP Fragment Conformers v1.0",
    dataset_tagline="HF/6-31G* conformers of diverse fragment molecules with multiple Br.",
    description=(
        "A dataset containing molecules from the "
        "`OpenFF ESP Fragment Conformers v1.0` with modifications. "
        "Molecules with multiple Cl atoms were selected from the "
        "dataset and the Cl atoms were replaced with Br.\n\n"
        "For each molecule, a set of up to 5 conformers were generated by:\n"
        "  * generating a set of up to 1000 conformers with a RMS cutoff of 0.5 Å "
        "using the OpenEye backend of the OpenFF toolkit\n"
        "  * applying ELF conformer selection (max 5 conformers) using OpenEye\n\n"
        "Each conformer will be converged according to the 'GAU_LOOSE' criteria."
    ),
)
dataset.clear_qcspecs()
dataset.add_qc_spec(
    program="psi4",
    method="hf",
    basis="6-31G*",
    spec_name="HF/6-31G*",
    spec_description="The standard HF/6-31G* basis used to derive RESP style charges.",
    store_wavefunction=WavefunctionProtocolEnum.orbitals_and_eigenvalues
)
dataset._get_specifications()

{'HF/6-31G*': OptimizationSpecification(program='geometric', qc_specification=QCSpecification(program='psi4', driver=<SinglepointDriver.deferred: 'deferred'>, method='hf', basis='6-31g*', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)), keywords={'coordsys': 'dlc', 'enforce': 0.0, 'epsilon': 1e-05, 'reset': True, 'qccnv': False, 'molcnv': False, 'check': 0, 'trust': 0.1, 'tmax': 0.3, 'maxiter': 300, 'convergence_set': 'GAU'}, protocols=OptimizationProtocols(trajectory=<TrajectoryProtocolEnum.all: 'all'>))}

Looking at the qc_specifications they do seem to be there:

dataset.qc_specifications["HF/6-31G*"].store_wavefunction

<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>

It also works for a BasicDataset:

dataset = BasicDataset(
    dataset_name="OpenFF multi-Br ESP Fragment Conformers v1.0",
    dataset_tagline="HF/6-31G* conformers of diverse fragment molecules with multiple Br.",
    description=(
        "A dataset containing molecules from the "
        "`OpenFF ESP Fragment Conformers v1.0` with modifications. "
        "Molecules with multiple Cl atoms were selected from the "
        "dataset and the Cl atoms were replaced with Br.\n\n"
        "For each molecule, a set of up to 5 conformers were generated by:\n"
        "  * generating a set of up to 1000 conformers with a RMS cutoff of 0.5 Å "
        "using the OpenEye backend of the OpenFF toolkit\n"
        "  * applying ELF conformer selection (max 5 conformers) using OpenEye\n\n"
        "Each conformer will be converged according to the 'GAU_LOOSE' criteria."
    ),
)
dataset.clear_qcspecs()
dataset.add_qc_spec(
    program="psi4",
    method="hf",
    basis="6-31G*",
    spec_name="HF/6-31G*",
    spec_description="The standard HF/6-31G* basis used to derive RESP style charges.",
    store_wavefunction=WavefunctionProtocolEnum.orbitals_and_eigenvalues
)
dataset._get_specifications()

{'HF/6-31G*': QCSpecification(program='psi4', driver=<SinglepointDriver.energy: 'energy'>, method='hf', basis='6-31g*', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>))}

(cc https://github.com/openforcefield/qca-dataset-submission/pull/335#issuecomment-1835378295)

Version: 0.50.1

Hi @lilyminium good spot, I see that we try to propagate it to qcfractal here when making the OptimizationSpecification but it seems to get lost along the way. With some digging in qcfractal I found that the OptimizationSpecification forces the use of the default single-point protocol which is removing our wavefunction settings. My guess is to stop the wavefunction from being saved for an entire trajectory but we could have this restriction removed if we pass a protocol which only saves the final result from an optimization. cc @bennybp @dotsdl

QCElemental right now doesn't support protocols for the gradient calculations. The structure in QCElemental is a bit different than QCFractal: QCFractal reuses a full QCSpecification object, but QCElemental has its own, which unfortunately doesn't have protocols: https://github.com/MolSSI/QCElemental/blob/e942b810f1681b7d38209d0fed55b49954e6e4b5/qcelemental/models/procedures.py#L56

So I fix the protocols to be the default to at least maintain consistency in the DB (I should probably add a comment there). This was a huge pain during the migration from the legacy version to the current version.

This is part of a QCElemental refactor which has been a long time coming, unfortunately.

Since I'm taking over maintenance of QCSubmit, I took some time to wrap my head around this. Here are notes, maybe just for myself, though I'd really appreciate correction if I'm misunderstanding something:

The high-level problem is that we can't currently submit an QCS OptimizationDataset that requests wavefunction evaluations.
The conceptual reason we can't do that is because the QC spec used when running an optimization doesn't distinguish between "use this QC spec on every gradient evaluation" and "use this QC spec only on the final evaluation". Saving wavefunctions on each gradient step would be very wasteful, and most users will only want them for the final step.
The technical problem is what @bennybp posted just above - QCEngine doesn't support protocols for gradients. So basically, when running QCS's dataset.submit, the resulting QCF dataset has its qc_specification's protocols attribute overwritten.
We can get wavefunctions from BasicDataset (not optimization) submissions. This is tested here and I'm able to rerun locally and see that the resulting QCP SinglePointRecords do indeed have a .wavefunction attribute populated by a ton of numbers.
So it IS possible to get wavefunctions for these evaluations currently, but we'd need to submit a new BasicDataset using the final molecules from the existing optimization dataset's results. (and now that I look, @lilyminium has done exactly that here)
In the past, it ALSO wasn't possible to do optimizations and just get wavefunctions for final structures. It looks like it may have been technically possible to submit an optimization dataset that requests wavefunctions, but it would have taken a ton of storage space if that actually went through.
In new QCF, the driver value for optimizations has changed from gradient to deferred. This change in wording makes me think that there are plans for datasets that say "here is a QC spec, but DEFER evaluating it until the gradient evaluations are done". I didn't dig far into this, but I'm optimistic that this is the meaning.
@bennybp's post just above this one reiterates that it wasn't and isn't possible to request wavefunctions in optimizations. I'm not sure if "This is part of a QCElemental refactor which has been a long time coming, unfortunately." means "unfortunately it will continue NOT being possible after a refactor that has/will happen" or "it will be possible after the refactor, which unfortunately will take a very long time". Could you clarify?

So I think the right solution is that the force_qcspec validator should either warn or raise an error when a protocol is overwritten. Does that sound good to you @bennybp? If not, we can have QCSubmit raise an error in that situation instead.

Largely correct, with a few more clarifications:

* So it IS possible to get wavefunctions for these evaluations currently, but we'd need to submit a new BasicDataset using the final molecules from the existing optimization dataset's results. (and now that I look, @lilyminium has done exactly that [here](https://github.com/openforcefield/qca-dataset-submission/pull/341))

I vaguely plan (with no timetable) to have this built in, since it will be very common: https://github.com/MolSSI/QCFractal/issues/748

Another option is to overhaul the qcelemental models to take in two specifications, but that would be a bit more disruptive.

* In new QCF, the `driver` value for  optimizations has changed from `gradient` to `deferred`. This change in wording makes me think that there are plans for datasets that say "here is a QC spec, but DEFER evaluating it until the gradient evaluations are done". I didn't dig far into this, but I'm optimistic that this is the meaning.

I wouldn't be so optimistic :)

That value pertains only to the driver (which is typically energy, gradient, hessian). There's an awkwardness when we allow that to be set by the user for an optimization, because it is largely ignored - whether an energy, gradient, or hessian is run is controlled by the optimization software, not the user. So what would it mean to submit an optimization with driver = energy when geometric is just going to do gradients anyway. So the (admittedly hacky) idea was to make this deferred value for the enum - the overall type of calculation is being deferred to the optimization software.

* @bennybp's post just above this one reiterates that it wasn't and isn't possible to request wavefunctions in optimizations. I'm not sure if "This is part of a QCElemental refactor which has been a long time coming, unfortunately." means "unfortunately it will continue NOT being possible after a refactor that has/will happen" or "it will be possible after the refactor, which unfortunately will take a very long time". Could you clarify?

It's been on the table for a good while now, along with some qcelemental/qcschema splitting ideas. They just end up taking a back seat to other pressing issues.

So I think the right solution is that the force_qcspec validator should either warn or raise an error when a protocol is overwritten. Does that sound good to you @bennybp? If not, we can have QCSubmit raise an error in that situation instead.

I agree. The current behavior is technically correct but hidden. So raising an exception is better than silently changing their input.

Awesome. Thanks for the thorough response. This makes a lot of sense, and I think it'd be great to put an exception in QCFractal when protocols are overwritten :-)

openforcefield / openff-qcsubmit

store_wavefunction not propagated #244