openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 9 forks source link

Also add QM energies with 'default' OpenFF compute spec? #39

Closed jchodera closed 2 years ago

jchodera commented 2 years ago

@peastman: I just realized that the dataset we generated only used the QM level of theory used for OpenMM SPICE, which would mean the data is not useful to the OpenFF folks because it is not compatible with the default OpenFF compute spec (B3LYP-D3BJ/DZVP). We included both levels of theory for this recent RNA dataset so the dataset would be compatible with both OpenMM SPICE and OpenFF datasets, and it looks like the OpenFF level of theory is much less expensive.

Would it be OK to have @pavankum add the OpenFF compute spect to the SPICE QCArchive dataset so we end up with both sets of QM data on QCArchive? We can still primarily distribute the more expensive QM data in our HDF5 distributions, but having both would enable multiple applications:

peastman commented 2 years ago

It's fine if you want to compute the same conformations at a lower level of theory. But let's be careful about calling the result "SPICE". We don't want to do anything that might create confusion, or lead someone to get the low accuracy results thinking they're getting the high accuracy ones.

At some point you might want to consider updating to a better level of theory for OpenFF. B3LYP is pretty dated at this point. There are newer functionals that provide better accuracy at the same cost.

tmarkland commented 2 years ago

There is probably a good naming convention one could use for SPICE configurations but at a different level of theory (maybe OpenFF already has adopted a particular one) e.g. SPICE(B3LYP-D3BJ/DZVP) or SPICE@B3LYP-D3BJ/DZVP etc. where SPICE would refer to the current level of theory and the ones with brackets or @ would denote the same configurations computed a different way.

peastman commented 2 years ago

The risk with that is that someone would see a reference to it somewhere and come away thinking, "SPICE uses a cheap, inaccurate level of theory." It would have a high risk of causing confusion.

jchodera commented 2 years ago

I definitely agree we want to avoid confusion!

We can give the other levels of theory a less prominent role in the manuscript (or even name them SPICE-lite, etc), and control what we put in the HDF5 files we make available for download and how we name them, which will be the primary way people interact with the dataset.

If they access it through the QCPortal, they will see there are multiple levels of theory attached---it would be impossible for them to conclude there is only one low level of theory present.

Practically, if would also be a huge pain, a significant waste of space, and rather awkward to try to correlate data between datasets if other levels of theory were generated as entirely separate groups of datasets in QCArchive.

Does this make sense? Or am I missing some other failure mode of concern?

peastman commented 2 years ago

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

pavankum commented 2 years ago

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

Yeah, the access is through explicit specification of theory level as in the line here in downloader script. We can completely avoid mentioning other QC specs if we choose to and whoever wants to work with the other spec can download it at their own volition.

If models from second spec are much closer in accuracy to spice_default then it would be helpful to do much larger molecules with the second spec as John mentioned.

In practice I expect very few people to access it directly through the API.

I agree.

peastman commented 2 years ago

That sounds like a good plan.

jchodera commented 2 years ago

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

I think our primary user group will be downloading the HDF5 files we control, or via the downloader we provide.

But QCArchive has a great QCPortal API that is improving its support for bulk downloads. Currently, it's still a great way for exploring datasets. Check out this example, which shows how to access a reaction dataset and browse which levels of theory and molecules are available.

jchodera commented 2 years ago

@pavankum is running this for us now! https://github.com/openforcefield/qca-dataset-submission/pulls?q=is%3Apr+label%3Acompute-openff-spice+

It looks like essentially everything is complete (except for some errored calculations).

jchodera commented 1 year ago

I definitely agree we want to avoid confusion!

Practically, we can give the other levels of theory a less prominent role in the manuscript (or even name them SPICE-lite, etc), and control what we put in the HDF5 files we make available for download and how we name them, which will be the primary way people interact with the dataset.

If they access it through the QCPortal, they will see there are multiple levels of theory attached---it would be impossible for them to conclude there is only one low level of theory present.

Practically, if would also be a huge pain, a big waste of space, and very awkward to try to correlate data between datasets if other levels of theory were generated as entirely separate groups of datasets in QCArchive.

Does this make sense? Or am I missing some other failure mode of concern.

peastman commented 1 year ago

Let me emphasize once again: SPICE is computed at ωB97M-D3BJ/def2-TZVPPD. Any computations performed at any other level of theory are not SPICE. They are a different dataset that needs to have a different name and must never be referred to as "SPICE", "SPICE-lite", or anything similar. Anything else will create confusion. If the current organization of the data on QCArchive creates confusion, then the data organization needs to be fixed.

giadefa commented 1 year ago

There is a similar situation with MD17. It has been computed at two different levels of theory and it is often confusing in papers at what level a benchmark is done.

On Tue, Oct 11, 2022 at 5:42 PM Peter Eastman @.***> wrote:

Let me emphasize once again: SPICE is computed at ωB97M-D3BJ/def2-TZVPPD. Any computations performed at any other level of theory are not SPICE. They are a different dataset that needs to have a different name and must never be referred to as "SPICE", "SPICE-lite", or anything similar. Anything else will create confusion. If the current organization of the data on QCArchive creates confusion, then the data organization needs to be fixed.

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/issues/39#issuecomment-1274905119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOSHUOUAVJUTFO7FXDLWCWDGRANCNFSM55ELKFWA . You are receiving this because you are subscribed to this thread.Message ID: @.***>