openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
152 stars 9 forks source link

Coordinate running calculations #11

Closed peastman closed 2 years ago

peastman commented 3 years ago

Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?

The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).

pavankum commented 2 years ago

But I'm not sure that's really true.

Yeah, I agree, digging into the errors, will report back.

jchodera commented 2 years ago

@peastman: The MSK computing resources are heavily underutilized right now because @dotsdl where running more jobs causes havoc for other users by not freeing up scratch space when done. @dotsdl is nearly finished addressing this issue and will then be able to scale up further on the MSK resources.

peastman commented 2 years ago

It looks like Sherlock will let me use eight nodes at a time. Additional jobs beyond that remain pending in the queue with the status "QOSMaxCpuPerUserLimit". That should let it complete about 1000 of the PubChem calculations per day. Once we get to the DES370K dimers, they should go faster since they're smaller.

pavankum commented 2 years ago

I checked the errors on PRP with dipeptide single points set (259) and most of them are all similar to an old bug (fixed) with py-cpuinfo, wonder why it is popping up again, I think @dotsdl noticed this before. @dotsdl when you have a moment can you please check the pods on PRP and if we need to pin the version of py-cpuinfo.

Edit: Hmm, from the traceback it looks like neither 'brand_raw' nor 'brand' keyword checks work for this particular compute node

{"error_type": "unknown_error", "error_message": "QCEngine Unknown Error: Traceback (most recent call last):
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 53, in get_global
    _global_values["cpu_brand"] = _global_values["cpuinfo"]["brand_raw"]
KeyError: 'brand_raw'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 407, in run_qcschema
    ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 554, in run_json_qcschema
    val, wfn = methods_dict_[json_data["driver"]](method, **kwargs)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/driver.py", line 739, in gradient
    wfn = procedures['gradient'][lowername](lowername, molecule=molecule, **kwargs)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2485, in run_scf_gradient
    ref_wfn = run_scf(name, **kwargs)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2390, in run_scf
    scf_wfn = scf_helper(name, post_scf=False, **kwargs)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 1582, in scf_helper
    disp_energy = scf_wfn._disp_functor.compute_energy(scf_wfn.molecule(), scf_wfn)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/empirical_dispersion.py", line 215, in compute_energy
    local_options={"scratch_directory": core.IOManager.shared_object().get_default_path()})
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/compute.py", line 83, in compute
    config = get_config(local_options=local_options)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 282, in get_config
    node = get_node_descriptor(hostname)
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 238, in get_node_descriptor
    hostname = get_global("hostname")
  File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 56, in get_global
    _global_values["cpu_brand"] = _global_values["cpuinfo"]["brand"]
KeyError: 'brand'
", "extras": null}
dotsdl commented 2 years ago

Hey all, I made a huge mistake this week. In submitting pubchem sets 2 - 5 through local infrastructure (our Github Action times out after 6 hours; this is our current workaround for such large sets), I failed to pull latest master of qca-dataset-submission that included the submittable dataset.json.bz2s without orbitals_and_eigenvalues set for wavefunction storage. This means all records corresponding to these datasets will carry wavefunctions, and this will be too much for the current storage solution on public QCArchive.

I have parked the computations for these submissions under the openff-defunct compute tag; this will keep them from being computed for now while @pavankum, @bennybp, and I devise a solution early next week. It may be possible to delete these records directly in the DB, allowing us to re-submit with wavefunction: None, but we will have to game this and other options out before we proceed.

Serious apologies for this; it was a stupid oversight on my part in trying to get these submissions through. We do have other sets for compute to work on while we solve this problem, and we will try to minimize time lost.

dotsdl commented 2 years ago

RE: cpuinfo, I've created a PR against QCEngine to address this: https://github.com/MolSSI/QCEngine/pull/339

From our PRP pod logs, there appear to be multiple nodes on which this problem can arise. I will try to mark each of these with anti-affinity to avoid pods landing there, but this is more of a compensating control than anything until we get a fix like the above deployed in QCEngine.

pavankum commented 2 years ago
Progress as of 2022-01-18 13:03 UTC Dataset name Complete Remaining Comments
SPICE DES Monomers Single Points Dataset v1.1 18700 - COMPLETE
SPICE Solvated Amino Acids Single Points Dataset v1.1 1300 - COMPLETE
SPICE Dipeptides Single Points Dataset v1.1 19746 14104 Running now - 58% done
SPICE Dipeptides Optimization Dataset v1.0 610 32717 Low priority full optimization set
SPICE PubChem Set 1 Single Points Dataset v1.1 50954 67652 Running now - 43% done
SPICE PubChem Set 2 Single Points Dataset v1.0 11165 110375 On hold - sorting out storage issue
SPICE PubChem Set 3 Single Points Dataset v1.0 3642 118584 On hold - sorting out storage issue
SPICE Pubchem Set 4 Single Points Dataset v1.0 474 122276 On hold - sorting out storage issue
SPICE Pubchem Set 5 Single Points Dataset v1.0 652 122498 On hold - sorting out storage issue
SPICE Pubchem Set 6 Single Points Dataset v1.0 0 123800 In queue
SPICE DES370K Single Points Dataset v1.0 0 345682 In queue

@dotsdl is working with Ben from QCA to resolve the pubchem sets 2-5 submission issue, and as of now we still have lot of work for the compute nodes to chew on, so this won't hinder any progress. Also, @dotsdl is coordinating with qcengine team to get his fix for pre-emptible manager jobs leaving behind temp files, and there will be a minor release as soon as the PR gets reviewed.

peastman commented 2 years ago

Thanks! By my count, it completed 43,783 calculations in the last week. A lot of those were dipeptides, which on average are a bit larger than the PubChem molecules. Once it finishes those, I'll estimate it should be able to do about 60,000 per week. At that rate, it will take about three months to get through all the PubChem molecules. That leaves the DES370K dimers, but they're much smaller so they should go very quickly.

That's just to give us a very vague idea of where we are. We can update those estimates as we get more data in the coming weeks.

dotsdl commented 2 years ago

An update: despite our efforts to mitigate storage increases, @bennybp is observing ~30 GB/d increase in utilization on public QCArchive. At this rate MolSSI will run out of storage in ~2 weeks. As an additional mitigation step, we have shut off OpenFF workers executing openff-spice; this allows us to continue executing datasets critical to FF fitting while we work on near-term solutions to the storage issue.

@peastman you may continue running managers on your resources so we still make forward progress here. I don't think execution rates are high enough to add significantly to the problem on the receiving end.

peastman commented 2 years ago

What about the ones that weren't set to store wavefunctions. Could we start the DES370K dimers going?

dotsdl commented 2 years ago

We should be able to start DES370K without wavefunctions. We can also start pubchem set 6 without wavefunctions, too.

Realized also that we were executing SPICE Dipeptides Single Points Dataset v1.1 (https://github.com/openforcefield/qca-dataset-submission/pull/259) under the default openff compute tag, and this is also storing wavefunctions. I have changed the compute tag to openff-defunct to park this for now as well.

peastman commented 2 years ago

Thanks! DES370K is much higher priority than pubchem 6.

peastman commented 2 years ago

Also note that we've hardly run any calculations on PubChem 4 and 5, so if it's easier to just delete them and start over, we won't be losing much. And the amount done on PubChem 2 and 3 isn't that much more.

jchodera commented 2 years ago

@dotsdl : I'm a bit confused here. Storing wavefunctions seems physically incompatible with QCArchive's current hardware situation. The only solution is to halt all datasets that are storing wavefunctions, re-create them without storing wavefunctions, and purge the datasets with wavefunctions from the QCArchive database directly. Is that the current plan?

dotsdl commented 2 years ago

@jchodera we have halted all datasets that are storing wavefunctions, and I got confirmation today from @bennybp that the growth rate is no longer alarming.

Our next step is to submit datasets that we have not yet submitted, but without wavefunctions. We have:

As for pubchem sets 1 - 5 as well as SPICE Dipeptides Single Points Dataset v1.1, which we have submitted with wavefunctions attached, we have parked these for now with the openff-defunct compute tag. We cannot simply resubmit these with the existing dataset.json.bz2 submission artifacts, even with changing store_wavefunction from orbitals_and_eigenvalues to none, as the destination wavefunction key on a single point record is part of the protocols object, and is not included in the deduplication hashing. The server will view the submission as unchanged, and will not create new records with the wavefunction: none protocol we desire here. We also cannot yet delete records and guarantee internal consistency in the DB.

One way around the deduplication machinery is to translate each molecule by e.g. 1 angstrom, and we can do this if necessary, but we can buy some time with the two submissions above and the still-executing SPICE Dipeptides Optimization Dataset v1.0. The cleaner approach in my view would be to attempt record deletion with the upcoming new server deployment, which does support deletion, then resubmit with the existing artifacts.

I recognize there is a lot packed in the above. Happy to discuss live or elaborate further if needed. We are trying to push forward where we can without exacerbating the storage issues on public QCA further.

jchodera commented 2 years ago

@dotsdl : Maybe this is a good time to decouple the infrastructure issues from what the highest-value science goals are, since we're now five months into the project with almost no usable data to show for it.

@peastman : Can you rank all the datasets in order of importance? My guess at priority (and rationale) is:

@dotsdl : I suggest writing off anything that's already been run as dead, excising it in the future when supported by QCArchive, and restarting the new datasets in order of priority (with the 1A shift hack) with settings that will not kill the QCArchive infrastructure.

Sorry for the odyssey in pain, folks. There are always growing pains in scaling any new technology.

peastman commented 2 years ago

The DES monomers and the solvated amino acids are already done, so there's no reason to rerun them. Here is how I would rank the others.

  1. DES370 dimers - this is where most of our information about nonbonded interactions comes from
  2. Dipeptides single points - essential for anything involving proteins
  3. PubChem, in order - they're sorted so the earliest ones have the most diversity
  4. Dipeptides optimization - I don't have any use for this, and I don't believe it will contribute significant information that isn't in the other datasets
peastman commented 2 years ago

Just as a note, in the last few days some crazy expensive calculations have started appearing. I have nodes that have been running for nearly 24 hours and only completed 3 tasks. I don't believe anything in SPICE should take anywhere close to that long. Maybe they're from some other project? But my nodes are supposed to be configured to prioritize SPICE calculations over other ones. It would be really helpful if there were some way I could determine what was going on.

pavankum commented 2 years ago

Ahh, I might be wrong, please wait for @dotsdl to take a look

pavankum commented 2 years ago

my guess is that some of the low priority tasks from dipeptide optimization set made it to the queue, checking the managers on completed jobs, meanwhile re-tagged the set to not send any jobs to any queue.

dotsdl commented 2 years ago

Thank you @jchodera and @peastman. I will prepare new versions of the following (all without wavefunctions, with molecule coordinates translated), and submit in order:

SPICE DES370K Single Points Dataset v1.0 is computing now, and this is the highest priority in @peastman's list.

@peastman do you have the --verbose flag running for your managers? If so, you should see the full task specs in the logs for your managers. These won't tell you what dataset the task is from, but if it has procedure: geometric in the content then this means it is an optimization, not a single point, calculation.

pavankum commented 2 years ago

@dotsdl SPICE Dipeptides Single Points Dataset is a small one with 33K calculations and it is 60% done, do you think the storage space we get back from resubmitting pubchem set 1 (and removing current version with wfns) would be more than enough to compensate for the remaining calculations on this set? We can save some compute time if we do so.

dotsdl commented 2 years ago

@pavankum since it's small I'd actually rather proceed with a v1.2 resubmission and halt all compute on the v1.1 submission for the Dipeptides Single Point Set. It appears wavefunction storage really does grow space utilization fast with this large basis set, and I'd rather we eliminate it entirely from the sets above. We also don't plan on attempting deletion soon, since it technically isn't supported by the server code and would in effect be direct DB surgery (with potentially unintended negative consequences) until we have the new QCA server code deployed.

A hard halt on wavefunction storage where we don't need them is the approach I'd like to take here. This saves us the space for cases where we do want it, such as OpenFF datasets for ESP work.

dotsdl commented 2 years ago

@pavankum SPICE Dipeptides Single Points Dataset v1.2 is ready for review! Please do examine it with suspicion; feel free to squash and merge when satisfied!

dotsdl commented 2 years ago

All PRs to openforcefield/qca-dataset-submission for the above have been created. Submission Action is running now for SPICE Dipeptides Single Points Dataset v1.2 and SPICE PubChem Set 1 Single Points Dataset v1.2.

peastman commented 2 years ago

Thanks!

peastman commented 2 years ago

It's been about two weeks since we restarted the computations without wavefunctions. This seems like a good time for another look at how we're doing overall.

The DES370K dimers are about 99% done. At this point it's cycling errors. For whatever reason, some of these structures seem to be very difficult. I see error rates over 50% on some nodes. I don't know if there's anything we can do about that.

The dipeptides are about half done, completing about 2000 computations per day. At that rate, it should take about another week to complete all of them.

Based on experience the first time we ran them, the PubChem molecules tend to go about twice as fast as the dipeptides, though with a higher error rate (around 10%, compared to no errors at all for the dipeptides). If we continue at the current rate, it will probably take about one month each for the PubChem sets, or about six months to get through all of them. We can start experimenting with fitting models before all of them are done, though.

jchodera commented 2 years ago

@dotsdl are we at a point where we can scale up the lilac workers without taking out all the scratch directories? If so, would be great to increase throughput.

dotsdl commented 2 years ago

@jchodera We're almost there. @bennybp and I have prepared QCFractal release 0.15.8.1, and @bennybp is deploying this to public QCA. Apologies, but an outage at conda-forge on Azure builds delayed the conda package by several days.

I am putting together new prod environments for managers now, and will deploy workers with them to PRP and Lilac once I have word from Ben that public QCA is upgraded. This new release should allow us to fully utilize Lilac and PRP, as well as other resources, for increased throughput.

dotsdl commented 2 years ago

I've deployed QCFractal 0.15.8.1 managers to Lilac and PRP, and have upped the deployment count for each substantially. I haven't observed indications of the issues we had previously, so assuming we get decent allocations of jobs I expect we'll get higher throughput this week.

peastman commented 2 years ago

That's great news, thanks!

peastman commented 2 years ago

I'm not seeing any sign of the extra compute resources being used. It only seems to be completing about 1000 tasks per day.

jchodera commented 2 years ago

We're up to 55 x 16 thread-slots = 880 thread-slots on lilac now.

jchodera commented 2 years ago

I see 1329 calculations in the last 24-hour difference reported for the https://github.com/openforcefield/qca-dataset-submission/pull/269.

peastman commented 2 years ago

It has typically been getting through about 2000 per day, so it's down rather than up. I haven't been able to get as many nodes on Sherlock the last few days, which accounts for the decrease. But there's no sign of any new resources being added.

dotsdl commented 2 years ago

I'm not currently getting any pods on PRP despite my requested resources there, and I do have Lilac prioritized for OpenFF biopolymer work over SPICE. We are running with the cpuqueue mostly on Lilac, which keeps each execution limited to no more than 6 hours. Do we anticipate many of these calculations taking longer than that? If so I can switch off cpuqueue requests and stick to preemptible there, which have a 157 hour limit.

peastman commented 2 years ago

That depends on the hardware. On Sherlock it's averaging about 6 core-hours per calculation. Since each node has 32 cores, the clock time for each calculation is far less than that.

jchodera commented 2 years ago

You can still submit cpuqueue jobs with up to 72 hour wall clock limits---it will just exclude some nodes with 6-hour wall clock limits.

dotsdl commented 2 years ago

Ah cool, thanks @jchodera! Doing you think upping to the 72 hour wall clock limits for cpuqueue would significantly reduce our footprint of workers on Lilac, with preemptible jobs in the queue as well?

jchodera commented 2 years ago

Doing you think upping to the 72 hour wall clock limits for cpuqueue would significantly reduce our footprint of workers on Lilac, with preemptible jobs in the queue as well?

If you can check the logs of the 6h jobs and see how much time is wasted by early termination, that would help us make a rational decision here!

dotsdl commented 2 years ago

Looking to the logs I think we're fine; I see many multiples (up to ~50) completed with largely 100% success rates across our cpuqueue workers. I don't think we're losing much with a 6-hour limit, or at least it's not apparent at this time.

peastman commented 2 years ago

Since PubChem set 1 was submitted four days ago, it has averaged just over 2000 calculations per day. Compare that to 5000 per day it was averaging in mid January. And that included dipeptides, which were larger than the PubChem molecules. We still seem to be getting very little resources from most of the clusters.

jchodera commented 2 years ago

@dotsdl : Can we fully dedicate lilac to the OpenMM calculations? Also, any luck with the HPC folks in helping optimize the lilac deployment? Let me know if you need help with that.

jchodera commented 2 years ago

@dotsdl : It looks like you're also asking for one core per thread-slot, which means that we're wasting hyperthreads:

affinity[core(1)*1:distribute=pack]

Was this left over from my previous attempts (which was a mistake!) or the result of new profiling?

If we don't do this, we can fit 2x as many workers on the nodes at once.

If this is not the result of profiling, maybe we could try removing the whole affinity[core(1)*1:distribute=pack] block and see if that helps throughput?

dotsdl commented 2 years ago

@jchodera I have not worked with the Lilac HPC folks for any profiling on these workloads. Part of the challenge is the wide range of memory requirements that can come down the pipe, so we have chosen a memory configuration that is reasonably high (providing 70GiB to psi4 execution) and a number of cores that give us reasonable coverage of nodes based on bhost output. If there are ways to optimize our usage of Lilac further, I'm happy to work with you on it.

Over this last week we have de-prioritized SPICE workloads to give more time time to ESP optimizations and protein capped 1-mer sidechain torsiondrive across all OpenFF compute resources, as these are key for FF development right now at OpenFF. I will ask to shift priority on Lilac to SPICE, however, on OpenFF#qcfractal-compute.

dotsdl commented 2 years ago

@jchodera on hyperthreading, I wasn't aware that we were wasting hyperthreads, but I'm also not sure how well psi4 works with hyperthreading (as in, depending on what it's doing it may perform less-well); @pavankum, do you have insights?

I did not add the affinity... components; I inherited these from the original submission scripts.

pavankum commented 2 years ago

@dotsdl hyperthreading is supported by psi4 and I think it is mostly handled by MKL subroutines, but switching it off is recommended for improved performance, here is one post where Holger Kruse comments on that, I think your current lilac deployment is good but you can test it with and without the options John specified to confirm again.

peastman commented 2 years ago

I'm trying to inspect the progress so far by using the code from the notebook posted above.

ds = fc.get_collection('Dataset', 'SPICE PubChem Set 1 Single Points Dataset v1.2')
spec = ds.list_records().iloc[0].to_dict()
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])

That returns 118606 records, which is correct. But every one of them has its status set to 'INCOMPLETE'. Is that not the correct way to find tasks that have been finished? I also searched for the ID of a task from the Sherlock logs, and it isn't present. Does that mean it's actually working on some other dataset?

pavankum commented 2 years ago

But every one of them has its status set to 'INCOMPLETE'.

I just checked it. I could see spice
COMPLETE 9621
ERROR 17
INCOMPLETE 108968
NaN 0
RUNNING 0

I also searched for the ID of a task from the Sherlock logs, and it isn't present. Does that mean it's actually working on some other dataset?

Task ids are different from record ids so you may not see it here.

@dotsdl bumped the priority on Lilac and we would see improved throughput this week, apologies for the slow pace.

peastman commented 2 years ago

Task ids are different from record ids so you may not see it here.

How can I determine what dataset a task ID is for?