openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
133 stars 6 forks source link

How to submit calculatons #82

Closed peastman closed 8 months ago

peastman commented 9 months ago

We need to figure out how we're going to submit calculations for SPICE 2. For version 1 we used the https://github.com/openforcefield/qca-dataset-submission repository. It uses github automation, so that you submit calculations by creating pull requests. Status updates are automatically posted to the PR, and it handles error cycling. The automation scripts use QCSubmit, which is a layer on top of the QCPortal API to provide additional useful features.

Due to a major redesign of QCPortal, QCSubmit is currently not functional. It's being updated, but there isn't a firm date for when the new version will be ready. Also, the new version of QCPortal provides many of the features it was created to add. According to @j-wags, QCSubmit still provides useful features for optimization datasets. For single point datasets, which is what we're using, the benefits are much smaller.

So we have a few questions to decide.

  1. Should we wait for QCSubmit, or just write our own scripts that use the QCPortal API directly?
  2. Should we try to automate the process through github actions, or just create scripts that we run by hand?

It looks to me like creating submissions should be about equally simple either way. Having status updates automatically posted to github is convenient but probably not that important. It's easy to write a script that you run to check the status. Error cycling might be more complicated, although I'm hoping it may no longer be as important. Newer versions of psi4 seem to be a lot better about not producing the sort of intermittent errors it's needed for. And while generating v1, we wasted a lot of computation time on error cycling, just repeating the same calculations over and over that kept failing.

Another benefit of the github approach is that it provides a record of the submission, but that too may be unnecessary for our purposes. All our submission will be the same type of dataset, and they'll all use exactly the same level of theory and other settings. I believe the only inputs needed for the submission will be the dataset name, and the HDF5 file containing the conformations (which is already in the repository).

cc @dotsdl

peastman commented 9 months ago

Here's a first pass at a script for submitting datasets (not yet tested).

from qcportal import PortalClient
from qcportal.singlepoint import QCSpecification, SinglepointDatasetNewEntry
from qcportal.molecules import Molecule
from openmm.unit import nanometer, bohr
import openff.toolkit
import openff.units
import numpy as np
import h5py
import sys

dataset_name = sys.argv[1]
filename = sys.argv[2]
input_file = h5py.File(filename)
client = PortalClient.from_file()
scale = (1*nanometer).value_in_unit(bohr)
keywords = {'maxiter': 200,
            'scf_properties': ['dipole', 'quadrupole', 'wiberg lowdin indices', 'mayer indices', 'mbis charges', 'mbis dipoles', 'mbis quadrupoles', 'mbis octupoles'],
            'wcombine': False}
spec = QCSpecification(program='psi4', driver='gradient', method='wb97m-d3bj', basis='def2-tzvppd', keywords=keywords)
dataset = client.add_dataset('singlepoint', dataset_name)
dataset.add_specification('wb97m-d3bj/def2-tzvppd', spec)
for group in input_file:
    smiles = input_file[group]['smiles'].asstr()[0]
    conformations = np.array(input_file[group]['conformations'])*scale
    ffmol = openff.toolkit.topology.Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
    symbols = [atom.symbol for atom in ffmol.atoms]
    total_charge = sum(atom.formal_charge/openff.units.unit.elementary_charge for atom in ffmol.atoms)
    for conformation in conformations:
        molecule = Molecule(symbols=symbols, geometry=conformation.flatten(), molecular_charge=total_charge, canonical_isomeric_explicit_hydrogen_mapped_smiles=smiles)
        dataset.add_entry(group, molecule)
dataset.submit(tag='spice-psi4-181')

Does that look generally correct? A first test will be to resubmit the DES monomers dataset with it. If everything is working correctly, QCArchive should recognize that all the samples are identical to ones that already exist and skip calculating anything. Assuming that works, the next test would be to force it to recompute them by adding a small offset to the positions and see if the results match what we got before.

peastman commented 9 months ago

I edited the above script with a few changes.

j-wags commented 9 months ago

Checking this out by analogy to OFFMol.to_qcschema, it looks like the two things this doesn't have are:

For how we set multiplicity, I'm not totally sure since I'm not a QM person. Some hints are:

So, I think for "reasonable" molecules we're always treating multiplicity as 1. For radicals other values may be necessary.

I think the CMILES may need to go somewhere else (maybe there's now an actual "cmiles" attribute, but I vaguely recall mention of a properties dict or an identifiers attribute).

To test this without risking problems on the central QCArchive, I'd recommend spinning up a QCFractal "snowflake" (a mini server living in a local process) with a really cheap psi4 method, and then making sure that the outputs look reasonable (especially that the CMILES makes it through and the atom ordering seems sane).

peastman commented 9 months ago

I've tried a few different methods of getting it to store the SMILES, including identifiers = Identifiers(canonical_isomeric_explicit_hydrogen_mapped_smiles=smiles) and extras = {'canonical_isomeric_explicit_hydrogen_mapped_smiles': smiles}. Neither one works. The property gets set on the Molecule object before I call add_entry(). But when I query the dataset for either entries or records, it's not there anymore.

Using a Snowflake I can create the dataset, but no calculations get run. The status of the record just stays as 'waiting'. Do I need to do something else to make it run calculations?

peastman commented 9 months ago

@bennybp do you have any idea what I'm doing wrong that's causing the above problems?

j-wags commented 9 months ago

In QCSubmit's tests, we use the fulltest_client testing fixture from QCFractal, which makes a snowflake with some local compute workers. That method is distributed in the qcarchivetesting conda package. You may be able to use that function verbatim to get a snowflake that can run a few local calculations.

bennybp commented 9 months ago

I haven't looked too deeply, but my hunch is that the molecule already exists on the server without the identifiers, and add_entry won't overwrite that. You can test this by translating the molecule by a little and see what happens.

If that is the case, you can keep the old molecule and change the identifiers afterwards with the PortalClient. (https://github.com/MolSSI/QCFractal/blob/e8d9cba50b1e59bf8ff85992cd9dc8f94158fe1b/qcportal/qcportal/client.py#L473).

For the snowflake issue, you might not have the QM package installed in the backend. Something like this conda env should work (assuming you are using psi4): https://github.com/MolSSI/QCFractal/blob/main/qcarchivetesting/conda-envs/fulltest_snowflake.yaml

peastman commented 9 months ago

As far as I can tell all the dependencies are there. Is there a way to get an error message saying why it isn't running anything?

my hunch is that the molecule already exists on the server without the identifiers

This is with a snowflake. There's nothing on the server.

peastman commented 9 months ago

Here's an error message:

Process ForkProcess-2:
Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractal/snowflake.py", line 60, in _compute_process
    compute = ComputeManager(compute_config)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/compute_manager.py", line 133, in __init__
    self.app_manager = AppManager(self.manager_config)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/apps/app_manager.py", line 108, in __init__
    qcengine_functions = discover_programs_conda(None)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/apps/app_manager.py", line 34, in discover_programs_conda
    result = subprocess.check_output(cmd, universal_newlines=True, cwd=tmpdir)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python3', '/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/run_scripts/qcengine_list.py']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/run_scripts/qcengine_list.py", line 12, in <module>
    progs = {x: qcengine.get_program(x).get_version() for x in qcengine.list_available_programs()}
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcfractalcompute/run_scripts/qcengine_list.py", line 12, in <dictcomp>
    progs = {x: qcengine.get_program(x).get_version() for x in qcengine.list_available_programs()}
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcengine/programs/psi4.py", line 91, in get_version
    self.version_cache[which_prog] = safe_version(exc["stdout"].split()[-1])
IndexError: list index out of range

The error happens while querying psi4. The code

for x in qcengine.list_available_programs():
    print(x)
    print(qcengine.get_program(x).get_version())

prints

rdkit
2023.3.1
xtb
20.2
openmm
8.0.0
psi4

Here's the psi4 version in my conda environment:

psi4                      1.8.1            py39haabd4ea_2    conda-forge
loriab commented 9 months ago

what is psi4 --version, please? could there be multiple psi4's around? and check if qcengine is at 0.28.1

peastman commented 9 months ago

I think the conda package for psi4 on Mac is broken. Just running psi4 produces a Python exception. I tried a different computer running Linux, and there it's able to correctly enumerate the programs.

Running on that computer, the status is briefly listed as running and top shows that psi4 is running. Then it stops and the status switches to error. How can I find out what the error was? The documentation says, "The error and possibly the stdout/stderr properties may have more details about the error." But the record object has no attribute called error, stdout, or stderr.

What is the correct way to store the SMILES? When I create the Molecule I specify extras = {'canonical_isomeric_explicit_hydrogen_mapped_smiles': smiles}. If I then immediately print out molecule.extras it prints

{'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[Br:1][C:4]([Br:2])([Br:3])[H:5]'}

But when I query the record from the dataset, it lists extras=None.

peastman commented 9 months ago

and check if qcengine is at 0.28.1

It's 0.27.0. Conda doesn't find anything newer.

peastman commented 9 months ago

It's because psi4 pins the version. If I try to force qcengine=0.28.1 I get

Encountered problems while solving:
  - package psi4-1.8.1-py311hedf2024_2 requires qcengine >=0.27.0,<0.28.0a0, but none of the providers can be installed
bennybp commented 9 months ago

As to why a record is waiting, I was inspired to make that a feature (which isn't available yet, but will be a nice addition: https://github.com/MolSSI/QCFractal/pull/759)

For the identifiers part: It seems to work for me. The identifiers are attached to the molecule:

from qcfractal.snowflake import FractalSnowflake
from qcportal.molecules import Molecule

s = FractalSnowflake()
c = s.client()

m = Molecule(symbols=['h', 'h'],
             geometry=[0, 0, 0, 0, 0, 1], 
             identifiers={'canonical_isomeric_explicit_hydrogen_mapped_smiles': "abc123"}
)

ds = c.add_dataset('singlepoint', 'test dataset')
ds.add_entry('test_entry', molecule=m)

# Re-get the dataset
ds = c.get_dataset('singlepoint', 'test dataset')
entry = ds.get_entry('test_entry')
print(entry.molecule.identifiers)

ds.add_specification('test_spec', {'program': 'psi4', 'driver': 'energy', 'method': 'b3lyp', 'basis': '6-31g'})
ds.submit()

rec = ds.get_record('test_entry', 'test_spec')
print(rec.molecule.identifiers)

The value for canonical_isomeric_explicit_hydrogen_mapped_smiles appears in both

You could add also them to the entry.attributes as well, but I think the molecule makes sense

peastman commented 9 months ago

Perhaps I was making things too complicated by trying to create an Identifiers object. I take it the correct usage is just to pass a dict?

I want to create the new datasets in a way that's consistent with the existing ones. I'm looking at the existing 'SPICE DES Monomers Single Points Dataset v1.1' dataset. For the molecules in that dataset, the canonical SMILES is not present in identifiers:

Identifiers(molecule_hash='e11edc2979035fc70f58366ce13b6bb707adaf18', molecular_formula='C2H3N', smiles=None, inchi=None, inchikey=None, canonical_explicit_hydrogen_smiles=None, canonical_isomeric_explicit_hydrogen_mapped_smiles=None, canonical_isomeric_explicit_hydrogen_smiles=None, canonical_isomeric_smiles=None, canonical_smiles=None, pubchem_cid=None, pubchem_sid=None, pubchem_conformerid=None)

Instead it's in extras:

{'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:4]C:3([H:6])[C:2]#[N:1]'}

For consistency I think we should continue to put it in extras. But we also should presumably put it in identifiers, since that's now the recommended place for it?

I'm not clear on how to connect up records and entries. I try to loop over everything in the dataset like this:

for s in ds.specification_names:
    for e in ds.iterate_entries():
        print(s, e.name)
        print(ds.get_record(e.name, s))

But get_record() never finds anything:

spec_1 cc#n-3
None
spec_1 cnc-2
None
spec_1 nccco-16
None

Am I not calling it correctly? The API documentation says the first argument is the entry name and the second is the specification name.

loriab commented 9 months ago

It's because psi4 pins the version

of course it does: bad, self. I'll step up the v1.8.2 release that releases the pin. I haven't heard other reports of the mac psi4 being broken, though.

bennybp commented 9 months ago

For consistency I think we should continue to put it in extras. But we also should presumably put it in identifiers, since that's now the recommended place for it?

Either is ok, and can can be added/modified later (although right now extras on molecules are not modifiable, it is on my to-do list).

Am I not calling it correctly? The API documentation says the first argument is the entry name and the second is the specification name.

You are calling it correctly, but something is up with that dataset. There are no calculations submitted for spec_1, only for spec_2, spec_4, and spec_6.

r = ds.get_record('cc#n-3', 'spec_2')
print(r.id, r.status)
111567635 RecordStatusEnum.complete
peastman commented 9 months ago

This must be something to do with how specifications were created when the data was converted to the new format? There are actually six specifications. Here are their descriptions:

name='spec_1' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp', basis='dzvp', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_2' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp', basis='dzvp', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_3' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='wb97m-d3bj', basis='def2-tzvppd', keywords={'maxiter': 200, 'wcombine': False, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_4' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='wb97m-d3bj', basis='def2-tzvppd', keywords={'maxiter': 200, 'wcombine': False, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_5' specification=QCSpecification(program='dftd3', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp-d3bj', basis=None, keywords={}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: '`'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_6' specification=QCSpecification(program='dftd3', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp-d3bj', basis=None, keywords={}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

They come in pairs that are identical except for the value of wavefunction. One (which has no records) has it set to orbitals_and_eigenvalues, while the other (which has records for all samples), has it set to none.

spec_2 and spec_4 are the OpenFF and SPICE levels of theory, respectively. Those were the only two specifications I expected to be present. I have no idea where spec_6 came from. It has program='dftd3'???

@loriab here is the exception I get when running psi4 on the Mac. I assume it's unrelated to the problem I'm seeing on Linux, where the record status is reported as error with no error message.

Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/qcportal/bin/psi4", line 213, in <module>
    import psi4  # isort:skip
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/psi4/__init__.py", line 90, in <module>
    from .driver import endorsed_plugins
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/psi4/driver/__init__.py", line 56, in <module>
    from psi4.driver import gaussian_n
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/psi4/driver/gaussian_n.py", line 31, in <module>
    from psi4.driver import driver
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/psi4/driver/driver.py", line 49, in <module>
    from psi4.driver import driver_nbody
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/psi4/driver/driver_nbody.py", line 830, in <module>
    class ManyBodyComputer(BaseComputer):
  File "pydantic/main.py", line 204, in pydantic.main.ModelMetaclass.__new__
  File "pydantic/fields.py", line 488, in pydantic.fields.ModelField.infer
  File "pydantic/fields.py", line 419, in pydantic.fields.ModelField.__init__
  File "pydantic/fields.py", line 534, in pydantic.fields.ModelField.prepare
  File "pydantic/fields.py", line 728, in pydantic.fields.ModelField._type_analysis
  File "pydantic/fields.py", line 778, in pydantic.fields.ModelField._create_sub_type
  File "pydantic/fields.py", line 419, in pydantic.fields.ModelField.__init__
  File "pydantic/fields.py", line 534, in pydantic.fields.ModelField.prepare
  File "pydantic/fields.py", line 728, in pydantic.fields.ModelField._type_analysis
  File "pydantic/fields.py", line 778, in pydantic.fields.ModelField._create_sub_type
  File "pydantic/fields.py", line 419, in pydantic.fields.ModelField.__init__
  File "pydantic/fields.py", line 534, in pydantic.fields.ModelField.prepare
  File "pydantic/fields.py", line 633, in pydantic.fields.ModelField._type_analysis
  File "pydantic/fields.py", line 778, in pydantic.fields.ModelField._create_sub_type
  File "pydantic/fields.py", line 419, in pydantic.fields.ModelField.__init__
  File "pydantic/fields.py", line 534, in pydantic.fields.ModelField.prepare
  File "pydantic/fields.py", line 638, in pydantic.fields.ModelField._type_analysis
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/typing.py", line 851, in __subclasscheck__
    return issubclass(cls, self.__origin__)
TypeError: issubclass() arg 1 must be a class
dotsdl commented 9 months ago

This must be something to do with how specifications were created when the data was converted to the new format? There are actually six specifications. Here are their descriptions:

name='spec_1' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp', basis='dzvp', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_2' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp', basis='dzvp', keywords={'maxiter': 200, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_3' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='wb97m-d3bj', basis='def2-tzvppd', keywords={'maxiter': 200, 'wcombine': False, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_4' specification=QCSpecification(program='psi4', driver=<SinglepointDriver.gradient: 'gradient'>, method='wb97m-d3bj', basis='def2-tzvppd', keywords={'maxiter': 200, 'wcombine': False, 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'mbis_charges']}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: 'orbitals_and_eigenvalues'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_5' specification=QCSpecification(program='dftd3', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp-d3bj', basis=None, keywords={}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.orbitals_and_eigenvalues: '`'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

name='spec_6' specification=QCSpecification(program='dftd3', driver=<SinglepointDriver.gradient: 'gradient'>, method='b3lyp-d3bj', basis=None, keywords={}, protocols=AtomicResultProtocols(wavefunction=<WavefunctionProtocolEnum.none: 'none'>, stdout=True, error_correction=ErrorCorrectionProtocol(default_policy=True, policies=None), native_files=<NativeFilesProtocolEnum.none: 'none'>)) description=''

They come in pairs that are identical except for the value of wavefunction. One (which has no records) has it set to orbitals_and_eigenvalues, while the other (which has records for all samples), has it set to none.

@peastman it's not clear to me where you're seeing this. Can you show us what you are running that gives these specs?

I recall that when we first ran SPICE, we used specs that preserved wavefunctions ('orbitals_and_eigenvalues'), and this ended up resulting in a massive amount of data filling up the old server. We later chose to run the same specs without preserving wavefunctions (None), and those calculations we kept.

peastman commented 9 months ago

I generated that output with this code:

client = PortalClient('https://ml.qcarchive.molssi.org')
ds = client.get_dataset('singlepoint', 'SPICE DES Monomers Single Points Dataset v1.1')
for s in ds.specifications:
    print(ds.specifications[s])
    print()
peastman commented 8 months ago

@bennybp I have my script working when using a snowflake. Now I'm trying to test it on the real server by resubmitting the DES monomers dataset. That's one of the smaller ones: 374 molecules, all of them very small, 18,700 conformations total. It initially seems to be working, but creating the entries is really slow, about five per second. At that rate, the larger datasets would take over a day to submit. But after the first 100 molecules (about 20 minutes), it crashes with an exception. I tried twice and got the same exception both times.

Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='ml.qcarchive.molssi.org', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/peastman/workspace/spice-dataset/submission/submit.py", line 33, in <module>
    dataset.add_entry(f'{group}-{i}', molecule)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/singlepoint/dataset_models.py", line 120, in add_entry
    return self.add_entries(ent)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/singlepoint/dataset_models.py", line 88, in add_entries
    ret = self._client.make_request(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 358, in make_request
    r = self._request(method, endpoint, body=serialized_body, url_params=parsed_url_params)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 297, in _request
    r = self._req_session.send(prep_req, verify=self._verify, timeout=self._timeout)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='ml.qcarchive.molssi.org', port=443): Read timed out. (read timeout=60)
bennybp commented 8 months ago

I do see similar behavior (although not quite as bad on my end). I think there's some inefficiency in the add_entry code on the server that only shows up with larger datasets. I am investigating.

An alternative which will likely be much faster is to use the bulk add_entries function. It's a little clunky, but let me see if I can polish it up quick.

peastman commented 8 months ago

Thanks! I didn't realize there was a bulk version. I'll try that.

The above behavior was running the script at home. I managed to work around it by running it on a cluster with a faster internet connection. It was still slow, but at least it didn't time out.

When I resubmitted the DES monomers dataset, it didn't recognize any of the records as duplicates of existing ones. I let it rerun the whole dataset, and the results agree well with the existing ones.

peastman commented 8 months ago

Much better! It successfully submitted the whole dataset in just over a minute. And it recognized all the records as duplicates of the ones computed yesterday.

peastman commented 8 months ago

The script is in #85. I think this means we can finally start running calculations!

peastman commented 8 months ago

When I tried to submit a larger dataset (the PubChem boron silicon set, 174,450 conformations total) it still failed even with the bulk creation. Running from home it fails with the error

Traceback (most recent call last):
  File "/Users/peastman/workspace/spice-dataset/submission/submit.py", line 34, in <module>
    dataset.add_entries(entries)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/singlepoint/dataset_models.py", line 88, in add_entries
    ret = self._client.make_request(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 358, in make_request
    r = self._request(method, endpoint, body=serialized_body, url_params=parsed_url_params)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 311, in _request
    return self._request(method, endpoint, body=body, url_params=url_params, retry=False)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 323, in _request
    raise PortalRequestError(f"Request failed: {details['msg']}", r.status_code, details)
qcportal.client_base.PortalRequestError: Request failed: Token has expired (HTTP status 401)

Running on the cluster it gets a different error:

Traceback (most recent call last):
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/http/client.py", line 1378, in getresponse
    response.begin()
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/ssl.py", line 1311, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/ssl.py", line 1167, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connectionpool.py", line 538, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='ml.qcarchive.molssi.org', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/groups/tem26/peastman/workspace/spice-dataset/submission/submit.py", line 34, in <module>
    dataset.add_entries(entries)
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/singlepoint/dataset_models.py", line 88, in add_entries
    ret = self._client.make_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/client_base.py", line 358, in make_request
    r = self._request(method, endpoint, body=serialized_body, url_params=parsed_url_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/client_base.py", line 297, in _request
    r = self._req_session.send(prep_req, verify=self._verify, timeout=self._timeout)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='ml.qcarchive.molssi.org', port=443): Read timed out. (read timeout=60)

I tried twice on each computer and got the same result both times.

peastman commented 8 months ago

I think it's because qcportal hardcodes the timeout to 60 seconds:

https://github.com/MolSSI/QCPortal/blob/7966f3fdbca84251d5e07bb652af9db1a08ded1f/qcportal/client_base.py#L90

The server is taking longer than 60 seconds to reply, causing it to fail.

bennybp commented 8 months ago

There is a time limit on the client, but that is changeable. I should make it not a private variable, but you can set client._timeout = 120. The server also has a timeout, though, which is around 1-2 minutes, so it might not help.

I did make some changes to the server to make it faster, but that needs a new release (tentatively later this week). I will also add automatic batching to add_entries with a progress bar, which would hopefully stop this particular error once in for all. ds.submit is another story, though.

jchodera commented 8 months ago

@bennybp: What's the plan to deal with the server timeout? Will that be substantially increased? Can calculations be split into separate requests that append?

Even if the client timeout is increased and the server is sped up a bit, it seems we are still ultimately going to run into that timeout.

peastman commented 8 months ago

I tried setting client._timeout = 120, but I still got the same error.

peastman commented 8 months ago

I tried again, this time setting client._timeout = None to completely disable the client side timeout. Still the same result.

The real problem may be something different. The error message is Token has expired (HTTP status 401). 401 is an authentication failure. I see that _request() calls _refresh_JWT_token() to try to automatically renew the token. I gather that isn't working. I don't know how the token lifetime is set.

bennybp commented 8 months ago

@bennybp: What's the plan to deal with the server timeout? Will that be substantially increased? Can calculations be split into separate requests that append?

Two pieces to the plan. Short term is to do batching client side. This can be done already. This is from memory:

from qcportal.utils import chunk_iterable

for entry_batch in chunk_iterable(new_entries, 1000):
    ds.add_entries(entry_batch)

I want to implement this automatically in the client, with progress bars.

Long term we could exploit the server-side job queue for this, but that will take some time.

peastman commented 8 months ago

Thanks, that worked!

peastman commented 8 months ago

The scripts are in #85.