One-click QCArchive data

dgasmith commented 5 years ago

QCArchive is getting crowded with other datasets and something like client.list_collections is becoming quite long so it is becoming difficult to tell OpenFF data from other datasets. To enumerate a few issues in this regard:

Many collections from other groups, it would be good if only OpenFF datasets could be listed from a client.
Able to mark datasets that are still in progress and perhaps not show them by default.
How can we download just the relevant (or most useful?) OpenFF data in one click.

We are currently overhauling Collections and have the chance to alleviate these issues. From discussion with @davidlmobley I wanted to open the floor for suggestions and feature requests here.

davidlmobley commented 5 years ago

Seems like, on our end, we should rename so they all have consistent and helpful names that clearly indicate they are related to OpenFF (e.g. OpenFF at the beginning of their name?).

On the infrastructure end, would be nice to have some kind of tag so someone can just ask for all of the OpenFF-tagged datasets. So, yes to this:

How can we download just the relevant (or most useful?) OpenFF data in one click.

On this:

Able to mark datasets that are still in progress and perhaps not show them by default.

That'd be helpful

@jchodera @yudongqiu @trevorgokey may want to weigh in. Maybe also @j-wags .

yudongqiu commented 5 years ago

I think it would be helpful to have tags, which support intersections. For example, to list collections related to OpenFF:

client.list_collections(tags=['OpenFF'])

And to only list the collections that are used in release-1 fitting:

client.list_collections(tags=['OpenFF', 'release-1'])

which is a subset of the previous.

To avoid displaying too many collections by default, we can assign a few collections with the "default" tag, so calling

client.list_collections()

Will only show those datasets.

yudongqiu commented 5 years ago

To make the use of tags more convenient, I also agree that we can assign a different default tag when creating the client, such as:

client = ptl.FractalClient('https://api.qcarchive.molssi.org:443/', default_tag='OpenFF')

This will then make client.list_collections() to show only the OpenFF collections, equivalent to calling client.list_collections(tags=['OpenFF'])

trevorgokey commented 5 years ago

I think given the type/name of the collections can just be searched? For example client.list_collections("torsiondrive", "openff"), which searches the strings of the collection types/name? client.list_collections("", "openff") gives all OpenFF collections. Tags would be really good for things like release-1, so something like client.list_collections("", "openff", tags=["release-1.1.0"]) will give me all collections used to produce the corresponding OpenFF force field.

Thinking further, tags could be synonymous with specification searching, e.g. `client.list_collections("torsiondrive", "", tags=["ani"]), gives me any TD collections that contain ANI specs.

dgasmith commented 5 years ago

At a high level we have selection at 1) client initialize time and 2) at collection search time via tags. Both are quite straightforward to do, are there other types of limitation that we would like to provide?

davidlmobley commented 5 years ago

What about a way for US to tag data that we used in a release even if the dataset was not complete? e.g. suppose we'd run a dataset in which some of the calculations were not complete when we did the fits for release-1, so that the dataset itself might expand later, but not the compounds which were used in release-1?

dgasmith commented 5 years ago

It would be best to build new collections (since they are just pointers) to that data.

davidlmobley commented 5 years ago

OK, perfect. @yudongqiu can we make sure to pull all the data actually used for fitting for release-1 into a collection?

yudongqiu commented 5 years ago

It is possible. Several concerns:

There are several types of fitting data that was used in release-1, so there will need to be several collections each holding one type.
The new collections will be largely the same as the existing one.
The definition of “actually used for fitting” is not exactly clear. There is post-QM filtering. And ForceBalance also makes choice of which datapoint is contributing to the objective function based on the input setting.
The details of the post-QM filtering of the data will be lost.

Therefore, if the purpose of building such collections is to allow people to reproduce the fitting, this will not work well. The best reference is still the release tarball, which can be reproduced by running a few scripts.

davidlmobley commented 5 years ago

@yudongqiu -- I was thinking mainly of an easy way to pull the molecules (and only those molecules) utilized in fitting, however much they contributed to the fit (but not those molecules NOT utilized in fitting). From my perspective it'd be convenient to have a way to pull these directly from QCArchive to be able to have a "one-stop shop" for figuring out what chemistry our fitting process has seen and what it hasn't.

I think this will be a common question: Which molecules was this trained on? If one wants full details of exactly what was used from exactly which molecules, how much weight it carried, etc., the best place is the release tarball. But if you just want to know "which molecules" (AFTER post-QM filtering) this would be a good place to identify them.

Not saying we need to do it necessarily, just explaining my thought processs.

jchodera commented 5 years ago

Do we have any clarity on naming conventions right now? Is it critical to prefix them with OpenFF?

davidlmobley commented 4 years ago

I'd like to see them prefixed with OpenFF because it makes it easy to know what they are. But I am not aware of other conventions yet, though we should have some!

openforcefield / qca-dataset-submission

One-click QCArchive data #43