openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
152 stars 9 forks source link

Coordinate running calculations #11

Closed peastman closed 2 years ago

peastman commented 3 years ago

Are we ready to start submitting calculations? We have three molecule collections ready (solvated amino acids, dipeptides, and DES370K). We have agreement on the level of theory to use and what quantities to compute. I think that means we're ready to put together a draft of the first submission?

The solvated amino acids are probably the best one to start with. It's a very small collection (only 1300 conformations), although they're individually on the large side (about 75-95 atoms).

jchodera commented 3 years ago

CCing @pavankum @dotsdl @simonboothroyd @trevorgokey to see who might have available bandwidth to submit these to QCFractal.

pavankum commented 3 years ago

I can help with preparing the submissions, @jthorton already made a template so I can start from that. Just to clarify, these are all optimization datasets, right? Or single point calculations? Another question is about the name to prepend to every dataset, such as ML Dipeptides, ML Solvated Amino Acids, etc. @peastman already posed it here #9, it would be great to add that acronym in the dataset names.

peastman commented 3 years ago

These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in https://github.com/openmm/qmdataset/issues/7#issuecomment-926209812. Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive.

giadefa commented 3 years ago

Maybe we can compute some with GPUGRID as well.

g

On Fri, Oct 1, 2021 at 4:02 AM Peter Eastman @.***> wrote:

These are single point calculations. For each one we want to compute the energy and forces, as well as the quantities listed in #7 (comment) https://github.com/openmm/qmdataset/issues/7#issuecomment-926209812. Also the orbital coefficients and eigenvalues, if the storage requirements aren't prohibitive.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openmm/qmdataset/issues/11#issuecomment-931834488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWDZLZORTY3U36RF7DUEUJEXANCNFSM5FC6N7RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

peastman commented 3 years ago

Is there anything we can do to get this going? @pavankum you said you were waiting on a new release of QCFractal. Is that release expected soon? I notice they seem to do infrequent releases. The last one was in June, and the one before that was November of last year. If we're just waiting for the next regularly scheduled release, this project could be stuck on hold for months.

pavankum commented 3 years ago

I already pinged Ben Pritchard from Molssi ten days ago on openff qcfractal channel and offered help, and he said he would push to make a release last week, maybe @jchodera can inquire about the current status.

peastman commented 3 years ago

Great, thanks!

jchodera commented 3 years ago

MolSSI just had a major advisory board meeting in Washington DC. That just concluded yesterday. Now is probably a good time to inquire about status.

jchodera commented 3 years ago

@pavankum : It looks like @bennybp released QCFractal 0.15.7 and QCPortal 0.15.7, and the QCArchive server was upgraded yesterday, so you may be clear to proceed now.

When submitting these conformer datasets, can we also submit a standard OptimizationDataset alongside them as well starting from the list of molecule SMILES? That will help us establish which conformer generation scheme(s) are actually useful, since it's not clear which approach provides the most efficient use of computational resources at this point.

pavankum commented 3 years ago

@jchodera Sure, will make the submission PRs today.

pavankum commented 3 years ago

@peastman I got the ball rolling on submissions, made the PRs to qca-dataset-submission repo (pubchem sets - 243, 245, 246, 247, 248, 249, des370k - 244, solvated amino acids - 239, dipeptides - 251) and will push to compute once they get reviewed by David Dotson/Josh Horton.

I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?

@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?

jchodera commented 3 years ago

@pavankum: Thank you so much!

I used SPICE as a placeholder name, is there a consensus on the naming convention? For example, a subset of pubchem molecules (2501-5000) is named as "SPICE PubChem Set 2 Single Points Dataset", does this look okay?

I think this means you officially get to name this dataset since you did the submission work. :)

@jchodera do you want optimization datasets for a particular subset for comparison, or for all of these sets?

Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.

@peastman : I notice we skipped a dataset of monomers extracted from des370k---could we add those in with higher priority than the dimers? All you have to do is extract the unique set of monomers and conformations.

Also, I notice that many of those PubChem sets look completely nuts: image image image image

I'll prioritize trying to get approval to redistribute the other datasets we discussed under nonrestrictive licenses.

pavankum commented 3 years ago

Let's just do dipeptides (251) for now---that should provide an excellent comparison set without adding much to the compute burden.

@jchodera Thank you for the feedback, will add that to the submission list

peastman commented 3 years ago

I think this means you officially get to name this dataset since you did the submission work. :)

Congratulations on your excellent choice of name!

Also, I notice that many of those PubChem sets look completely nuts:

There are definitely some odd ones in there. This is partly due to how I ordered the molecules: it tries to choose ones that are maximally different from anything that has come before, so if a molecule is really unusual, it gets put very early in the dataset. If you only look at the first few pages of the first set, you'll get the impression this collection is full of things that don't look much like drugs, but it quickly settles down into much more ordinary ones.

I notice we skipped a dataset of monomers extracted from des370k

Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.

jchodera commented 3 years ago

Is there any reason to think they'll be useful? Back when I was trying to train models on DES370K and nothing else, I found that training just on dimers was difficult and adding in a few monomer conformations helped it to learn. But in this case it already has tons of data for single molecules. A few hundred extra molecules isn't likely to make much difference.

  1. It's very quick to run
  2. It's a minimal set covering a lot of biomolecular chemical space
  3. Many tools aren't yet set up to run on multiple molecules (e.g. espaloma, for building MM force fields)
  4. There's issues like BSSE that mean the monomer set is actually a bit different than well-separated molecules
  5. We'll easily be able to compare different strategies for generating conformers on this small standard dataset
  6. As you point out, it really helps learning to separate intermolecular from intramolecular interactions to include monomers
dotsdl commented 2 years ago

Hey all, we are currently executing SPICE PubChem Set 1 Single Points Dataset v1.1. Based on the growth rate of storage usage on QCArchive, at about 5MiB per calculation with wavefunction stored, we will go beyond QCArchive's storage capacity if we proceed in this way with the other 5 PubChem sets.

Is it known now whether or not wavefunctions (orbitals and eigenvalues) will be needed for the downstream use case of these datasets? If this is not known, can we begin using set 1 for that downstream case to arrive at a decision?

If wavefunctions are not needed at all, we can switch off wavefunction storage for the remaining 5 and proceed immediately. If wavefunctions are or may be required, we will need at least a 5TiB storage expansion of some kind on QCArchive.

peastman commented 2 years ago

Wavefunctions aren't needed for any of the applications I'm interested in. I think the argument for saving them was that it could save time if we later decided we wanted to compute additional quantities, or redo the computation with a more accurate method (https://github.com/openmm/qmdataset/issues/7#issuecomment-924970376). But if it causes storage problems, I don't think it's necessary.

jchodera commented 2 years ago

Thanks for clarifying, @peastman!

@dotsdl : Since the PubChem Set 1 is only 10% done after 4-5 days of compute, maybe it makes sense to purge the dataset and start over without wavefunctions? 5 MiB x 11K calculations is 55 GiB of data that is probably unnecessary, and the dataset would eventually consume 595 GiB for no reason.

jchodera commented 2 years ago

@dotsdl : One other quick question: Are the molecules sorted in order of increasing size? It may make sense to do so if you are regenerating the datasets, since this would allow the highest throughput initially, enabling us to catch other issues earlier (rather than later) in dataset generation.

pavankum commented 2 years ago

@peastman @jchodera A naive question, I am wondering whether Orbnet used any orbital information in their model, and you think that might be something relevant to your ML models as well, apart from forces and energies

jchodera commented 2 years ago

@pavankum : It's a great question! I don't doubt that information derived from wavefunction/orbital data would be valuable in training advanced machine learning potentials, but I don't believe any of the architectures we are considering now would make use of this information.

@jeherr: This is a really interesting idea---something to think about!

peastman commented 2 years ago

I am wondering whether Orbnet used any orbital information in their model

Not from the training data, if that's what you mean. Their model begins with a semi-empirical calculation, and the outputs of it become the input to the model. So in that sense, it likely does involve orbital information (I haven't looked at the details to see exactly what values they use). But the dataset just has energies, no orbitals or even forces. That's what they fit to.

pavankum commented 2 years ago

Thank you very much for the clarifications!!

dotsdl commented 2 years ago

Thanks all! I just spoke to @bennybp, and I propose we proceed as follows:

  1. We will let SPICE PubChem Set 1 Single Points Dataset v1.1 run as is, with wavefunctions attached. This will allow experimentation with wavefunctions if you are inclined in the near future.
  2. We will submit sets 2 through 6 without wavefunctions attached.

Please let me know if you object to any of this.

jchodera commented 2 years ago

If that works for you, go for it!

peastman commented 2 years ago

Where can I find the completed datasets?

jchodera commented 2 years ago

Looks like even the SPICE PubChem Set 1 Single Points Dataset v1.1 has a long ways to go: image

The data is all available in real time through the QCPortal API---see the example usage here, though I think this dataset is a new BasicDataset type that is not yet shown in the examples. You can use the example code in that notebook to browse and retrieve the available data.

The QCPortal API is not very performant for bulk downloads yet---@dotsdl has been working with @bennybp on speeding this up, and both MolSSI is recruiting a new postdoc and OpenFF is hiring a contractor (we're still searching) to make improvements to this infrastructure. Another goal is to make the data available via monolithic HDF5 files on the machine learning datasets dashboard---this currently has to be prepared by hand.

peastman commented 2 years ago

What about https://github.com/openforcefield/qca-dataset-submission/pull/254? It claims to be complete. I just want to look it over to make sure all the data content and organization looks right.

jchodera commented 2 years ago

I launched a MyBinder notebook using the OptimizationDataset example and tweaked it a bit:

Browsing the SPICE datasets notebook

Takeaways:

It's possible these issues are caused by the mybinder image being out of date (it has QCPortal v0.15.7), but we're going to need some help from @pavankum @bennybp here.

EDIT: It looks like this is the latest QCPortal version available on conda-forge (~1 month old).

dotsdl commented 2 years ago

@peastman and @jchodera, putting together a code snippet for how to access the data elements in each dataset. Will post here today.

jchodera commented 2 years ago

@dotsdl: If you want to do this within a Jupyter notebook, it can probably be added as a PR directly into the QCPortal examples folder so that others can benefit from it immediately!

dotsdl commented 2 years ago

Can do, and agreed. Access of Dataset records are woefully undocumented and those dataset types in particular are very user-unfriendly. @bennyp and I are actively working to remedy this.

pavankum commented 2 years ago

Sorry for the delay in response, here's one small snippet to look at data from a basic dataset

import qcportal

client = qcportal.FractalClient()
ds = client.get_collection(collection_type='Dataset', name='SPICE PubChem Set 1 Single Points Dataset v1.1')
print(ds.list_values())
method = 'wb97m-d3bj'
basis = 'def2-tzvppd'

# dataframe of records, this may take a while depending on the size of dataset
df = ds.get_records(method=method, basis=basis)

print(len(df))

for i in range(len(df)):
     if not isinstance(df.iloc[i].record, float):
             if df.iloc[i].record.status == 'COMPLETE':
                     rec = df.iloc[i]
                     break

print(rec.dict())

@dotsdl may add more

dotsdl commented 2 years ago

@peastman you can use this jupyter notebook as a starting point: Using single point Datasets.zip

Please start with SPICE PubChem Set 1 Single Points Dataset v1.1 for your experiments; where present, you'll want to use the v1.1 version of datasets on QCArchive, since these include the wcombine=False fix.

dotsdl commented 2 years ago

@pavankum from experience you should specify program and keywords like I do in the notebook, otherwise it is possible to get records that aren't part of the dataset submission back.

dotsdl commented 2 years ago

I've added this example as a PR to MolSSI/QCFractal#703. I'll capture any feedback on this there to expand its usefulness before merge.

jeherr commented 2 years ago

Also, I notice that many of those PubChem sets look completely nuts:

Odd molecules like this are good for regularization of trained models. In general, it's good to try not to target anything specific too much, and sometimes these molecules can be really information rich when you consider how much the other data all shares overly common substructures ad infinitum.

This is a really interesting idea---something to think about!

I feel like I have seen some work along these lines before, but I can't recall anything right now. I'll try to look around and see what I can find.

pavankum commented 2 years ago

Should've done this earlier, here is a list of datasets with their updated names and completion status of each. Right now the only bottleneck is compute resources since openff datasets are prioritized over these datasets and if you would like bump the priority of any single set, or order them in priority please let us know. If you're not satisfied with the pace of computation it would be great if you can contribute any compute resources, we would be happy to help you set up the qcfractal compute managers specific to these datasets. I can also keep updating this table (here or in a different issue) every few days so that you can catch up on the progress easily.

Dataset name Complete Remaining Comments
SPICE DES Monomers Single Points Dataset v1.1 18700 - COMPLETE
SPICE Solvated Amino Acids Single Points Dataset v1.1 1291 9 Almost there!
SPICE Dipeptides Single Points Dataset v1.1 57 33793 Running now
SPICE Dipeptides Optimization Dataset v1.0 610 32717 Low priority full optimization set
SPICE PubChem Set 1 Single Points Dataset v1.1 42802 75804 Running now
SPICE PubChem Set 2 Single Points Dataset v1.0 0 121540 In queue
SPICE PubChem Set 3 Single Points Dataset v1.0 0 122226 In queue
SPICE Pubchem Set 4 Single Points Dataset v1.0 0 122750 In queue
SPICE Pubchem Set 5 Single Points Dataset v1.0 0 123150 In queue
SPICE Pubchem Set 6 Single Points Dataset v1.0 0 123800 In queue
SPICE DES370K Single Points Dataset v1.0 0 345682 In PR review

cc: @dotsdl @jthorton @peastman @jchodera

jchodera commented 2 years ago

Thanks for the great summary, @pavankum!

jchodera commented 2 years ago

If you're not satisfied with the pace of computation it would be great if you can contribute any compute resources, we would be happy to help you set up the qcfractal compute managers specific to these datasets.

We should be using all available MSK CPU resources for this. I'll investigate whether something could be set up at Stanford as well.

peastman commented 2 years ago

Thanks! We'll look into what other resources might be available.

jchodera commented 2 years ago

We also might be underutilizing MSK resources at the moment. I'm investigating.

dotsdl commented 2 years ago

Reposting here from OpenFF slack:

@jchodera we are currently limiting our usage of Lilac to avoid filling disks on the compute nodes. I worked with @bennybp on Friday to devise a solution for manager cleanup on termination that is both reliable and acceptable in what it requires from QCEngine. Apologies this has taken so long, but there are several process boundaries and software layers to cross here. There are many poor solutions that fail to fully solve the problem, and so it has taken time to converge on one that does.

I am implementing what we devised on Friday today in MolSSI/QCFractal#700.

This issue on Lilac wasn't apparent in the past due to our use of a basis set that didn't sometimes require up to 250GiB of memory and/or ~70GiB of scratch space from the compute nodes they landed on. The SPICE datasets have challenged us in ways that are new.

jchodera commented 2 years ago

Thanks so much for the detailed update, @dotsdl, and glad to hear that progress is being made so we can make use of all available resources soon!

peastman commented 2 years ago

I've gotten an account on a cluster at Stanford. Can you provide instructions on what I need to do to start running calculations on it?

dotsdl commented 2 years ago

@peastman I think we should meet up for a working session to do this. The best approach depends on the configuration of the cluster, and we'll be able to arrive at this quickly in an interactive call.

peastman commented 2 years ago

It's now running on five nodes, each with 32 cores and 256 GB memory. According to the logs, it has completed about 30 tasks so far. Is there a way to confirm you're receiving the results?

dotsdl commented 2 years ago

@peastman with the information you just shared, that is sufficient. The manager would complain loudly if it wasn't able to send results back to public QCArchive.

Thanks again for your help with this!

pavankum commented 2 years ago

That's great! As long as the logs show the task submission and success rate you're good. I think it's difficult to monitor the jobs from different servers (fractal-managers of a certain kind, lilac or pacific-research-platform or sherlock ...). A bit tedious way is to check dataset by dataset and grabbing the manager_name metadata in the result record. Since these are huge datasets querying for the records takes quite a bit of time (half an hour to one hour or more for each), among the currently computing sets I checked jobs associated with a manager for pubchem sets 1, 2 & 3, and the dipeptide single points, and I can see pubchem set 2 has jobs on sherlock, assuming that's the stanford cluster

Pubchem set 1: ({'LilacQM': 13082, 'PacificResearchPlatformQM': 30707,  'vulkan': 73, 'NewCastlePsi4': 290, 'tscc': 28})
Pubchem set 2: {'PacificResearchPlatformQM': 92096, 'LilacQM': 5490,  'tscc': 14317, 'NewCastlePsi4': 51,
         'sherlock': 141}
Pubchem set 3: ({'PacificResearchPlatformQM': 26503, 'LilacQM': 2113, 'tscc': 1490})
Dipeptides single points: ({'PacificResearchPlatformQM': 31360, 'tscc': 1348, 'UCI': 1136})

Edit: Was writing this while David made a comment.

peastman commented 2 years ago

Thanks! That's really helpful information.

Based on those numbers, it looks like PacificResearchPlatformQM does far more than everything else put together. But I'm not sure that's really true. For example, the above says it has done 92,096 calculations for pubchem 2. But looking up the latest results for that dataset, only 10,449 have actually been completed. It looks like those numbers mostly reflect it producing a lot of errors very quickly.

It looks like Sherlock is currently completing about 600 calculations a day. At that rate, it would take it about three years to get through the whole dataset. I'll see if I can get a few more nodes, but it's still going to be a pretty minor contribution to what needs to be done. I guess every bit helps though.