Closed peastman closed 2 years ago
CCing @pavankum @dotsdl @simonboothroyd @trevorgokey to see who might have available bandwidth to submit these to QCFractal.
I can help with preparing the submissions, @jthorton already made a template so I can start from that. Just to clarify, these are all optimization datasets, right? Or single point calculations?
Another question is about the name to prepend to every dataset, such as ML Dipeptides
, ML Solvated Amino Acids
, etc. @peastman already posed it here #9, it would be great to add that acronym in the dataset names.
These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in https://github.com/openmm/qmdataset/issues/7#issuecomment-926209812. Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive.
Maybe we can compute some with GPUGRID as well.
g
On Fri, Oct 1, 2021 at 4:02 AM Peter Eastman @.***> wrote:
These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in #7 (comment) https://github.com/openmm/qmdataset/issues/7#issuecomment-926209812. Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openmm/qmdataset/issues/11#issuecomment-931834488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWDZLZORTY3U36RF7DUEUJEXANCNFSM5FC6N7RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Is there anything we can do to get this going? @pavankum you said you were waiting on a new release of QCFractal. Is that release expected soon? I notice they seem to do infrequent releases. The last one was in June, and the one before that was November of last year. If we're just waiting for the next regularly scheduled release, this project could be stuck on hold for months.
I already pinged Ben Pritchard from Molssi ten days ago on openff qcfractal channel and offered help, and he said he would push to make a release last week, maybe @jchodera can inquire about the current status.
Great, thanks!
MolSSI just had a major advisory board meeting in Washington DC. That just concluded yesterday. Now is probably a good time to inquire about status.
@pavankum : It looks like @bennybp released QCFractal 0.15.7 and QCPortal 0.15.7, and the QCArchive server was upgraded yesterday, so you may be clear to proceed now.
When submitting these conformer datasets, can we also submit a standard OptimizationDataset
alongside them as well starting from the list of molecule SMILES? That will help us establish which conformer generation scheme(s) are actually useful, since it's not clear which approach provides the most efficient use of computational resources at this point.
@jchodera Sure, will make the submission PRs today.
@peastman I got the ball rolling on submissions, made the PRs to qca-dataset-submission repo (pubchem sets - 243, 245, 246, 247, 248, 249, des370k - 244, solvated amino acids - 239, dipeptides - 251) and will push to compute once they get reviewed by David Dotson/Josh Horton.
I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?
@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?
@pavankum: Thank you so much!
I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?
I think this means you officially get to name this dataset since you did the submission work. :)
@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?
Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.
@peastman : I notice we skipped a dataset of monomers extracted from des370k---could we add those in with higher priority than the dimers? All you have to do is extract the unique set of monomers and conformations.
Also, I notice that many of those PubChem sets look completely nuts:
I'll prioritize trying to get approval to redistribute the other datasets we discussed under nonrestrictive licenses.
Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.
@jchodera Thank you for the feedback, will add that to the submission list
I think this means you officially get to name this dataset since you did the submission work. :)
Congratulations on your excellent choice of name!
Also, I notice that many of those PubChem sets look completely nuts:
There are definitely some odd ones in there. This is partly due to how I ordered the molecules: it tries to choose ones that are maximally different from anything that has come before, so if a molecule is really unusual, it gets put very early in the dataset. If you only look at the first few pages of the first set, you'll get the impression this collection is full of things that don't look much like drugs, but it quickly settles down into much more ordinary ones.
I notice we skipped a dataset of monomers extracted from des370k
Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.
Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.
Hey all, we are currently executing SPICE PubChem Set 1 Single Points Dataset v1.1. Based on the growth rate of storage usage on QCArchive, at about 5MiB per calculation with wavefunction stored, we will go beyond QCArchive's storage capacity if we proceed in this way with the other 5 PubChem sets.
Is it known now whether or not wavefunctions (orbitals and eigenvalues) will be needed for the downstream use case of these datasets? If this is not known, can we begin using set 1 for that downstream case to arrive at a decision?
If wavefunctions are not needed at all, we can switch off wavefunction storage for the remaining 5 and proceed immediately. If wavefunctions are or may be required, we will need at least a 5TiB storage expansion of some kind on QCArchive.
Wavefunctions aren't needed for any of the applications I'm interested in. I think the argument for saving them was that it could save time if we later decided we wanted to compute additional quantities, or redo the computation with a more accurate method (https://github.com/openmm/qmdataset/issues/7#issuecomment-924970376). But if it causes storage problems, I don't think it's necessary.
Thanks for clarifying, @peastman!
@dotsdl : Since the PubChem Set 1 is only 10% done after 4-5 days of compute, maybe it makes sense to purge the dataset and start over without wavefunctions? 5 MiB x 11K calculations is 55 GiB of data that is probably unnecessary, and the dataset would eventually consume 595 GiB for no reason.
@dotsdl : One other quick question: Are the molecules sorted in order of increasing size? It may make sense to do so if you are regenerating the datasets, since this would allow the highest throughput initially, enabling us to catch other issues earlier (rather than later) in dataset generation.
@peastman @jchodera A naive question, I am wondering whether Orbnet used any orbital information in their model, and you think that might be something relevant to your ML models as well, apart from forces and energies
@pavankum : It's a great question! I don't doubt that information derived from wavefunction/orbital data would be valuable in training advanced machine learning potentials, but I don't believe any of the architectures we are considering now would make use of this information.
@jeherr: This is a really interesting idea---something to think about!
I am wondering whether Orbnet used any orbital information in their model
Not from the training data, if that's what you mean. Their model begins with a semi-empirical calculation, and the outputs of it become the input to the model. So in that sense, it likely does involve orbital information (I haven't looked at the details to see exactly what values they use). But the dataset just has energies, no orbitals or even forces. That's what they fit to.
Thank you very much for the clarifications!!
Thanks all! I just spoke to @bennybp, and I propose we proceed as follows:
Please let me know if you object to any of this.
If that works for you, go for it!
Where can I find the completed datasets?
Looks like even the SPICE PubChem Set 1 Single Points Dataset v1.1 has a long ways to go:
The data is all available in real time through the QCPortal API---see the example usage here, though I think this dataset is a new BasicDataset
type that is not yet shown in the examples. You can use the example code in that notebook to browse and retrieve the available data.
The QCPortal API is not very performant for bulk downloads yet---@dotsdl has been working with @bennybp on speeding this up, and both MolSSI is recruiting a new postdoc and OpenFF is hiring a contractor (we're still searching) to make improvements to this infrastructure. Another goal is to make the data available via monolithic HDF5 files on the machine learning datasets dashboard---this currently has to be prepared by hand.
What about https://github.com/openforcefield/qca-dataset-submission/pull/254? It claims to be complete. I just want to look it over to make sure all the data content and organization looks right.
I launched a MyBinder notebook using the OptimizationDataset example and tweaked it a bit:
Browsing the SPICE datasets notebook
Takeaways:
SPICE DES Monomers Single Points Dataset v1.0
appears as a Dataset
, which I thought was the base class type and not a specific single-point dataset typeIt's possible these issues are caused by the mybinder image being out of date (it has QCPortal v0.15.7), but we're going to need some help from @pavankum @bennybp here.
EDIT: It looks like this is the latest QCPortal version available on conda-forge (~1 month old).
@peastman and @jchodera, putting together a code snippet for how to access the data elements in each dataset. Will post here today.
@dotsdl: If you want to do this within a Jupyter notebook, it can probably be added as a PR directly into the QCPortal examples folder so that others can benefit from it immediately!
Can do, and agreed. Access of Dataset
records are woefully undocumented and those dataset types in particular are very user-unfriendly. @bennyp and I are actively working to remedy this.
Sorry for the delay in response, here's one small snippet to look at data from a basic dataset
import qcportal
client = qcportal.FractalClient()
ds = client.get_collection(collection_type='Dataset', name='SPICE PubChem Set 1 Single Points Dataset v1.1')
print(ds.list_values())
method = 'wb97m-d3bj'
basis = 'def2-tzvppd'
# dataframe of records, this may take a while depending on the size of dataset
df = ds.get_records(method=method, basis=basis)
print(len(df))
for i in range(len(df)):
if not isinstance(df.iloc[i].record, float):
if df.iloc[i].record.status == 'COMPLETE':
rec = df.iloc[i]
break
print(rec.dict())
@dotsdl may add more
@peastman you can use this jupyter notebook as a starting point: Using single point Datasets.zip
Please start with SPICE PubChem Set 1 Single Points Dataset v1.1 for your experiments; where present, you'll want to use the v1.1 version of datasets on QCArchive, since these include the wcombine=False
fix.
@pavankum from experience you should specify program
and keywords
like I do in the notebook, otherwise it is possible to get records that aren't part of the dataset submission back.
I've added this example as a PR to MolSSI/QCFractal#703. I'll capture any feedback on this there to expand its usefulness before merge.
Also, I notice that many of those PubChem sets look completely nuts:
Odd molecules like this are good for regularization of trained models. In general, it's good to try not to target anything specific too much, and sometimes these molecules can be really information rich when you consider how much the other data all shares overly common substructures ad infinitum.
This is a really interesting idea---something to think about!
I feel like I have seen some work along these lines before, but I can't recall anything right now. I'll try to look around and see what I can find.
Should've done this earlier, here is a list of datasets with their updated names and completion status of each. Right now the only bottleneck is compute resources since openff datasets are prioritized over these datasets and if you would like bump the priority of any single set, or order them in priority please let us know. If you're not satisfied with the pace of computation it would be great if you can contribute any compute resources, we would be happy to help you set up the qcfractal compute managers specific to these datasets. I can also keep updating this table (here or in a different issue) every few days so that you can catch up on the progress easily.
Dataset name | Complete | Remaining | Comments |
---|---|---|---|
SPICE DES Monomers Single Points Dataset v1.1 | 18700 | - | COMPLETE |
SPICE Solvated Amino Acids Single Points Dataset v1.1 | 1291 | 9 | Almost there! |
SPICE Dipeptides Single Points Dataset v1.1 | 57 | 33793 | Running now |
SPICE Dipeptides Optimization Dataset v1.0 | 610 | 32717 | Low priority full optimization set |
SPICE PubChem Set 1 Single Points Dataset v1.1 | 42802 | 75804 | Running now |
SPICE PubChem Set 2 Single Points Dataset v1.0 | 0 | 121540 | In queue |
SPICE PubChem Set 3 Single Points Dataset v1.0 | 0 | 122226 | In queue |
SPICE Pubchem Set 4 Single Points Dataset v1.0 | 0 | 122750 | In queue |
SPICE Pubchem Set 5 Single Points Dataset v1.0 | 0 | 123150 | In queue |
SPICE Pubchem Set 6 Single Points Dataset v1.0 | 0 | 123800 | In queue |
SPICE DES370K Single Points Dataset v1.0 | 0 | 345682 | In PR review |
cc: @dotsdl @jthorton @peastman @jchodera
Thanks for the great summary, @pavankum!
If you're not satisfied with the pace of computation it would be great if you can contribute any compute resources, we would be happy to help you set up the qcfractal compute managers specific to these datasets.
We should be using all available MSK CPU resources for this. I'll investigate whether something could be set up at Stanford as well.
Thanks! We'll look into what other resources might be available.
We also might be underutilizing MSK resources at the moment. I'm investigating.
Reposting here from OpenFF slack:
@jchodera we are currently limiting our usage of Lilac to avoid filling disks on the compute nodes. I worked with @bennybp on Friday to devise a solution for manager cleanup on termination that is both reliable and acceptable in what it requires from QCEngine. Apologies this has taken so long, but there are several process boundaries and software layers to cross here. There are many poor solutions that fail to fully solve the problem, and so it has taken time to converge on one that does.
I am implementing what we devised on Friday today in MolSSI/QCFractal#700.
This issue on Lilac wasn't apparent in the past due to our use of a basis set that didn't sometimes require up to 250GiB of memory and/or ~70GiB of scratch space from the compute nodes they landed on. The SPICE datasets have challenged us in ways that are new.
Thanks so much for the detailed update, @dotsdl, and glad to hear that progress is being made so we can make use of all available resources soon!
I've gotten an account on a cluster at Stanford. Can you provide instructions on what I need to do to start running calculations on it?
@peastman I think we should meet up for a working session to do this. The best approach depends on the configuration of the cluster, and we'll be able to arrive at this quickly in an interactive call.
It's now running on five nodes, each with 32 cores and 256 GB memory. According to the logs, it has completed about 30 tasks so far. Is there a way to confirm you're receiving the results?
@peastman with the information you just shared, that is sufficient. The manager would complain loudly if it wasn't able to send results back to public QCArchive.
Thanks again for your help with this!
That's great! As long as the logs show the task submission and success rate you're good. I think it's difficult to monitor the jobs from different servers (fractal-managers of a certain kind, lilac or pacific-research-platform or sherlock ...). A bit tedious way is to check dataset by dataset and grabbing the manager_name
metadata in the result record. Since these are huge datasets querying for the records takes quite a bit of time (half an hour to one hour or more for each), among the currently computing sets I checked jobs associated with a manager for pubchem sets 1, 2 & 3, and the dipeptide single points, and I can see pubchem set 2 has jobs on sherlock, assuming that's the stanford cluster
Pubchem set 1: ({'LilacQM': 13082, 'PacificResearchPlatformQM': 30707, 'vulkan': 73, 'NewCastlePsi4': 290, 'tscc': 28})
Pubchem set 2: {'PacificResearchPlatformQM': 92096, 'LilacQM': 5490, 'tscc': 14317, 'NewCastlePsi4': 51,
'sherlock': 141}
Pubchem set 3: ({'PacificResearchPlatformQM': 26503, 'LilacQM': 2113, 'tscc': 1490})
Dipeptides single points: ({'PacificResearchPlatformQM': 31360, 'tscc': 1348, 'UCI': 1136})
Edit: Was writing this while David made a comment.
Thanks! That's really helpful information.
Based on those numbers, it looks like PacificResearchPlatformQM does far more than everything else put together. But I'm not sure that's really true. For example, the above says it has done 92,096 calculations for pubchem 2. But looking up the latest results for that dataset, only 10,449 have actually been completed. It looks like those numbers mostly reflect it producing a lot of errors very quickly.
It looks like Sherlock is currently completing about 600 calculations a day. At that rate, it would take it about three years to get through the whole dataset. I'll see if I can get a few more nodes, but it's still going to be a pretty minor contribution to what needs to be done. I guess every bit helps though.
Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?
The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).