Closed peastman closed 2 years ago
But I'm not sure that's really true.
Yeah, I agree, digging into the errors, will report back.
@peastman: The MSK computing resources are heavily underutilized right now because @dotsdl where running more jobs causes havoc for other users by not freeing up scratch space when done. @dotsdl is nearly finished addressing this issue and will then be able to scale up further on the MSK resources.
It looks like Sherlock will let me use eight nodes at a time. Additional jobs beyond that remain pending in the queue with the status "QOSMaxCpuPerUserLimit". That should let it complete about 1000 of the PubChem calculations per day. Once we get to the DES370K dimers, they should go faster since they're smaller.
I checked the errors on PRP with dipeptide single points set (259) and most of them are all similar to an old bug (fixed) with py-cpuinfo
, wonder why it is popping up again, I think @dotsdl noticed this before. @dotsdl when you have a moment can you please check the pods on PRP and if we need to pin the version of py-cpuinfo
.
Edit: Hmm, from the traceback it looks like neither 'brand_raw' nor 'brand' keyword checks work for this particular compute node
{"error_type": "unknown_error", "error_message": "QCEngine Unknown Error: Traceback (most recent call last):
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 53, in get_global
_global_values["cpu_brand"] = _global_values["cpuinfo"]["brand_raw"]
KeyError: 'brand_raw'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 407, in run_qcschema
ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/schema_wrapper.py", line 554, in run_json_qcschema
val, wfn = methods_dict_[json_data["driver"]](method, **kwargs)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/driver.py", line 739, in gradient
wfn = procedures['gradient'][lowername](lowername, molecule=molecule, **kwargs)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2485, in run_scf_gradient
ref_wfn = run_scf(name, **kwargs)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 2390, in run_scf
scf_wfn = scf_helper(name, post_scf=False, **kwargs)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/proc.py", line 1582, in scf_helper
disp_energy = scf_wfn._disp_functor.compute_energy(scf_wfn.molecule(), scf_wfn)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/psi4/driver/procrouting/empirical_dispersion.py", line 215, in compute_energy
local_options={"scratch_directory": core.IOManager.shared_object().get_default_path()})
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/compute.py", line 83, in compute
config = get_config(local_options=local_options)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 282, in get_config
node = get_node_descriptor(hostname)
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 238, in get_node_descriptor
hostname = get_global("hostname")
File "/opt/conda/envs/qcfractal/lib//python3.7/site-packages/qcengine/config.py", line 56, in get_global
_global_values["cpu_brand"] = _global_values["cpuinfo"]["brand"]
KeyError: 'brand'
", "extras": null}
Hey all, I made a huge mistake this week. In submitting pubchem sets 2 - 5 through local infrastructure (our Github Action times out after 6 hours; this is our current workaround for such large sets), I failed to pull latest master
of qca-dataset-submission
that included the submittable dataset.json.bz2
s without orbitals_and_eigenvalues
set for wavefunction storage. This means all records corresponding to these datasets will carry wavefunctions, and this will be too much for the current storage solution on public QCArchive.
I have parked the computations for these submissions under the openff-defunct
compute tag; this will keep them from being computed for now while @pavankum, @bennybp, and I devise a solution early next week. It may be possible to delete these records directly in the DB, allowing us to re-submit with wavefunction: None
, but we will have to game this and other options out before we proceed.
Serious apologies for this; it was a stupid oversight on my part in trying to get these submissions through. We do have other sets for compute to work on while we solve this problem, and we will try to minimize time lost.
RE: cpuinfo, I've created a PR against QCEngine to address this: https://github.com/MolSSI/QCEngine/pull/339
From our PRP pod logs, there appear to be multiple nodes on which this problem can arise. I will try to mark each of these with anti-affinity to avoid pods landing there, but this is more of a compensating control than anything until we get a fix like the above deployed in QCEngine.
Progress as of 2022-01-18 13:03 UTC | Dataset name | Complete | Remaining | Comments |
---|---|---|---|---|
SPICE DES Monomers Single Points Dataset v1.1 | 18700 | - | COMPLETE | |
SPICE Solvated Amino Acids Single Points Dataset v1.1 | 1300 | - | COMPLETE | |
SPICE Dipeptides Single Points Dataset v1.1 | 19746 | 14104 | Running now - 58% done | |
SPICE Dipeptides Optimization Dataset v1.0 | 610 | 32717 | Low priority full optimization set | |
SPICE PubChem Set 1 Single Points Dataset v1.1 | 50954 | 67652 | Running now - 43% done | |
SPICE PubChem Set 2 Single Points Dataset v1.0 | 11165 | 110375 | On hold - sorting out storage issue | |
SPICE PubChem Set 3 Single Points Dataset v1.0 | 3642 | 118584 | On hold - sorting out storage issue | |
SPICE Pubchem Set 4 Single Points Dataset v1.0 | 474 | 122276 | On hold - sorting out storage issue | |
SPICE Pubchem Set 5 Single Points Dataset v1.0 | 652 | 122498 | On hold - sorting out storage issue | |
SPICE Pubchem Set 6 Single Points Dataset v1.0 | 0 | 123800 | In queue | |
SPICE DES370K Single Points Dataset v1.0 | 0 | 345682 | In queue |
@dotsdl is working with Ben from QCA to resolve the pubchem sets 2-5 submission issue, and as of now we still have lot of work for the compute nodes to chew on, so this won't hinder any progress. Also, @dotsdl is coordinating with qcengine team to get his fix for pre-emptible manager jobs leaving behind temp files, and there will be a minor release as soon as the PR gets reviewed.
Thanks! By my count, it completed 43,783 calculations in the last week. A lot of those were dipeptides, which on average are a bit larger than the PubChem molecules. Once it finishes those, I'll estimate it should be able to do about 60,000 per week. At that rate, it will take about three months to get through all the PubChem molecules. That leaves the DES370K dimers, but they're much smaller so they should go very quickly.
That's just to give us a very vague idea of where we are. We can update those estimates as we get more data in the coming weeks.
An update: despite our efforts to mitigate storage increases, @bennybp is observing ~30 GB/d increase in utilization on public QCArchive. At this rate MolSSI will run out of storage in ~2 weeks. As an additional mitigation step, we have shut off OpenFF workers executing openff-spice
; this allows us to continue executing datasets critical to FF fitting while we work on near-term solutions to the storage issue.
@peastman you may continue running managers on your resources so we still make forward progress here. I don't think execution rates are high enough to add significantly to the problem on the receiving end.
What about the ones that weren't set to store wavefunctions. Could we start the DES370K dimers going?
We should be able to start DES370K without wavefunctions. We can also start pubchem set 6 without wavefunctions, too.
Realized also that we were executing SPICE Dipeptides Single Points Dataset v1.1
(https://github.com/openforcefield/qca-dataset-submission/pull/259) under the default openff
compute tag, and this is also storing wavefunctions. I have changed the compute tag to openff-defunct
to park this for now as well.
Thanks! DES370K is much higher priority than pubchem 6.
Also note that we've hardly run any calculations on PubChem 4 and 5, so if it's easier to just delete them and start over, we won't be losing much. And the amount done on PubChem 2 and 3 isn't that much more.
@dotsdl : I'm a bit confused here. Storing wavefunctions seems physically incompatible with QCArchive's current hardware situation. The only solution is to halt all datasets that are storing wavefunctions, re-create them without storing wavefunctions, and purge the datasets with wavefunctions from the QCArchive database directly. Is that the current plan?
@jchodera we have halted all datasets that are storing wavefunctions, and I got confirmation today from @bennybp that the growth rate is no longer alarming.
Our next step is to submit datasets that we have not yet submitted, but without wavefunctions. We have:
As for pubchem sets 1 - 5 as well as SPICE Dipeptides Single Points Dataset v1.1, which we have submitted with wavefunctions attached, we have parked these for now with the openff-defunct
compute tag. We cannot simply resubmit these with the existing dataset.json.bz2
submission artifacts, even with changing store_wavefunction
from orbitals_and_eigenvalues
to none
, as the destination wavefunction
key on a single point record is part of the protocols
object, and is not included in the deduplication hashing. The server will view the submission as unchanged, and will not create new records with the wavefunction: none
protocol we desire here. We also cannot yet delete records and guarantee internal consistency in the DB.
One way around the deduplication machinery is to translate each molecule by e.g. 1 angstrom, and we can do this if necessary, but we can buy some time with the two submissions above and the still-executing SPICE Dipeptides Optimization Dataset v1.0. The cleaner approach in my view would be to attempt record deletion with the upcoming new server deployment, which does support deletion, then resubmit with the existing artifacts.
I recognize there is a lot packed in the above. Happy to discuss live or elaborate further if needed. We are trying to push forward where we can without exacerbating the storage issues on public QCA further.
@dotsdl : Maybe this is a good time to decouple the infrastructure issues from what the highest-value science goals are, since we're now five months into the project with almost no usable data to show for it.
@peastman : Can you rank all the datasets in order of importance? My guess at priority (and rationale) is:
@dotsdl : I suggest writing off anything that's already been run as dead, excising it in the future when supported by QCArchive, and restarting the new datasets in order of priority (with the 1A shift hack) with settings that will not kill the QCArchive infrastructure.
Sorry for the odyssey in pain, folks. There are always growing pains in scaling any new technology.
The DES monomers and the solvated amino acids are already done, so there's no reason to rerun them. Here is how I would rank the others.
Just as a note, in the last few days some crazy expensive calculations have started appearing. I have nodes that have been running for nearly 24 hours and only completed 3 tasks. I don't believe anything in SPICE should take anywhere close to that long. Maybe they're from some other project? But my nodes are supposed to be configured to prioritize SPICE calculations over other ones. It would be really helpful if there were some way I could determine what was going on.
Ahh, I might be wrong, please wait for @dotsdl to take a look
my guess is that some of the low priority tasks from dipeptide optimization set made it to the queue, checking the managers on completed jobs, meanwhile re-tagged the set to not send any jobs to any queue.
Thank you @jchodera and @peastman. I will prepare new versions of the following (all without wavefunctions, with molecule coordinates translated), and submit in order:
SPICE DES370K Single Points Dataset v1.0
is computing now, and this is the highest priority in @peastman's list.
@peastman do you have the --verbose
flag running for your managers? If so, you should see the full task specs in the logs for your managers. These won't tell you what dataset the task is from, but if it has procedure: geometric
in the content then this means it is an optimization, not a single point, calculation.
@dotsdl SPICE Dipeptides Single Points Dataset is a small one with 33K calculations and it is 60% done, do you think the storage space we get back from resubmitting pubchem set 1 (and removing current version with wfns) would be more than enough to compensate for the remaining calculations on this set? We can save some compute time if we do so.
@pavankum since it's small I'd actually rather proceed with a v1.2 resubmission and halt all compute on the v1.1 submission for the Dipeptides Single Point Set. It appears wavefunction storage really does grow space utilization fast with this large basis set, and I'd rather we eliminate it entirely from the sets above. We also don't plan on attempting deletion soon, since it technically isn't supported by the server code and would in effect be direct DB surgery (with potentially unintended negative consequences) until we have the new QCA server code deployed.
A hard halt on wavefunction storage where we don't need them is the approach I'd like to take here. This saves us the space for cases where we do want it, such as OpenFF datasets for ESP work.
@pavankum SPICE Dipeptides Single Points Dataset v1.2 is ready for review! Please do examine it with suspicion; feel free to squash and merge when satisfied!
All PRs to openforcefield/qca-dataset-submission for the above have been created. Submission Action is running now for SPICE Dipeptides Single Points Dataset v1.2 and SPICE PubChem Set 1 Single Points Dataset v1.2.
Thanks!
It's been about two weeks since we restarted the computations without wavefunctions. This seems like a good time for another look at how we're doing overall.
The DES370K dimers are about 99% done. At this point it's cycling errors. For whatever reason, some of these structures seem to be very difficult. I see error rates over 50% on some nodes. I don't know if there's anything we can do about that.
The dipeptides are about half done, completing about 2000 computations per day. At that rate, it should take about another week to complete all of them.
Based on experience the first time we ran them, the PubChem molecules tend to go about twice as fast as the dipeptides, though with a higher error rate (around 10%, compared to no errors at all for the dipeptides). If we continue at the current rate, it will probably take about one month each for the PubChem sets, or about six months to get through all of them. We can start experimenting with fitting models before all of them are done, though.
@dotsdl are we at a point where we can scale up the lilac workers without taking out all the scratch directories? If so, would be great to increase throughput.
@jchodera We're almost there. @bennybp and I have prepared QCFractal release 0.15.8.1, and @bennybp is deploying this to public QCA. Apologies, but an outage at conda-forge on Azure builds delayed the conda package by several days.
I am putting together new prod environments for managers now, and will deploy workers with them to PRP and Lilac once I have word from Ben that public QCA is upgraded. This new release should allow us to fully utilize Lilac and PRP, as well as other resources, for increased throughput.
I've deployed QCFractal 0.15.8.1 managers to Lilac and PRP, and have upped the deployment count for each substantially. I haven't observed indications of the issues we had previously, so assuming we get decent allocations of jobs I expect we'll get higher throughput this week.
That's great news, thanks!
I'm not seeing any sign of the extra compute resources being used. It only seems to be completing about 1000 tasks per day.
We're up to 55 x 16 thread-slots = 880 thread-slots on lilac now.
I see 1329 calculations in the last 24-hour difference reported for the https://github.com/openforcefield/qca-dataset-submission/pull/269.
It has typically been getting through about 2000 per day, so it's down rather than up. I haven't been able to get as many nodes on Sherlock the last few days, which accounts for the decrease. But there's no sign of any new resources being added.
I'm not currently getting any pods on PRP despite my requested resources there, and I do have Lilac prioritized for OpenFF biopolymer work over SPICE. We are running with the cpuqueue
mostly on Lilac, which keeps each execution limited to no more than 6 hours. Do we anticipate many of these calculations taking longer than that? If so I can switch off cpuqueue
requests and stick to preemptible
there, which have a 157 hour limit.
That depends on the hardware. On Sherlock it's averaging about 6 core-hours per calculation. Since each node has 32 cores, the clock time for each calculation is far less than that.
You can still submit cpuqueue
jobs with up to 72 hour wall clock limits---it will just exclude some nodes with 6-hour wall clock limits.
Ah cool, thanks @jchodera! Doing you think upping to the 72 hour wall clock limits for cpuqueue
would significantly reduce our footprint of workers on Lilac, with preemptible
jobs in the queue as well?
Doing you think upping to the 72 hour wall clock limits for cpuqueue would significantly reduce our footprint of workers on Lilac, with preemptible jobs in the queue as well?
If you can check the logs of the 6h jobs and see how much time is wasted by early termination, that would help us make a rational decision here!
Looking to the logs I think we're fine; I see many multiples (up to ~50) completed with largely 100% success rates across our cpuqueue
workers. I don't think we're losing much with a 6-hour limit, or at least it's not apparent at this time.
Since PubChem set 1 was submitted four days ago, it has averaged just over 2000 calculations per day. Compare that to 5000 per day it was averaging in mid January. And that included dipeptides, which were larger than the PubChem molecules. We still seem to be getting very little resources from most of the clusters.
@dotsdl : Can we fully dedicate lilac to the OpenMM calculations? Also, any luck with the HPC folks in helping optimize the lilac deployment? Let me know if you need help with that.
@dotsdl : It looks like you're also asking for one core per thread-slot, which means that we're wasting hyperthreads:
affinity[core(1)*1:distribute=pack]
Was this left over from my previous attempts (which was a mistake!) or the result of new profiling?
If we don't do this, we can fit 2x as many workers on the nodes at once.
If this is not the result of profiling, maybe we could try removing the whole affinity[core(1)*1:distribute=pack]
block and see if that helps throughput?
@jchodera I have not worked with the Lilac HPC folks for any profiling on these workloads. Part of the challenge is the wide range of memory requirements that can come down the pipe, so we have chosen a memory configuration that is reasonably high (providing 70GiB to psi4
execution) and a number of cores that give us reasonable coverage of nodes based on bhost
output. If there are ways to optimize our usage of Lilac further, I'm happy to work with you on it.
Over this last week we have de-prioritized SPICE workloads to give more time time to ESP optimizations and protein capped 1-mer sidechain torsiondrive across all OpenFF compute resources, as these are key for FF development right now at OpenFF. I will ask to shift priority on Lilac to SPICE, however, on OpenFF#qcfractal-compute.
@jchodera on hyperthreading, I wasn't aware that we were wasting hyperthreads, but I'm also not sure how well psi4
works with hyperthreading (as in, depending on what it's doing it may perform less-well); @pavankum, do you have insights?
I did not add the affinity...
components; I inherited these from the original submission scripts.
@dotsdl hyperthreading is supported by psi4 and I think it is mostly handled by MKL subroutines, but switching it off is recommended for improved performance, here is one post where Holger Kruse comments on that, I think your current lilac deployment is good but you can test it with and without the options John specified to confirm again.
I'm trying to inspect the progress so far by using the code from the notebook posted above.
ds = fc.get_collection('Dataset', 'SPICE PubChem Set 1 Single Points Dataset v1.2')
spec = ds.list_records().iloc[0].to_dict()
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
That returns 118606 records, which is correct. But every one of them has its status set to 'INCOMPLETE'
. Is that not the correct way to find tasks that have been finished? I also searched for the ID of a task from the Sherlock logs, and it isn't present. Does that mean it's actually working on some other dataset?
But every one of them has its status set to
'INCOMPLETE'
.
I just checked it. I could see | spice | |
---|---|---|
COMPLETE | 9621 | |
ERROR | 17 | |
INCOMPLETE | 108968 | |
NaN | 0 | |
RUNNING | 0 |
I also searched for the ID of a task from the Sherlock logs, and it isn't present. Does that mean it's actually working on some other dataset?
Task ids are different from record ids so you may not see it here.
@dotsdl bumped the priority on Lilac and we would see improved throughput this week, apologies for the slow pace.
Task ids are different from record ids so you may not see it here.
How can I determine what dataset a task ID is for?
Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?
The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).