openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
146 stars 8 forks source link

What molecules to include in version 1 #1

Closed peastman closed 2 years ago

peastman commented 3 years ago

We need to decide what molecules to include in the first version of the dataset. Here is the current proposal based on the most recent meeting.

peastman commented 2 years ago

Enamine building blocks The full set is much too big to include. @jchodera will propose a subset.

Just a reminder that we still need this. Can you suggest the subset to use?

peastman commented 2 years ago

I'm looking into DrugBank. It appears to me that chemical structures are only available as part of a commercial product. That means first that we would have to pay for them, and second that we presumably wouldn't be allowed to redistribute them. Am I missing something?

peastman commented 2 years ago

Enamine also seems unclear. Do they provide redistributable, machine readable structures for all their molecules? They want you to create an account and sign in just to download their catalog, which suggests it's proprietary.

peastman commented 2 years ago

If we can't use the above, what other databases could we consider using? ChEMBL is probably one of the most obvious choices. They have about 2.1 million "bioactive molecules with drug-like properties". Over 1.9 million of them are tagged as "small molecule".

jchodera commented 2 years ago

There's a huge difference between "compounds that researchers are going to be able to buy and be able to use for drug discovery" and "compounds in ChEMBL that may be in private screening libraries and inaccessible to most researchers.

Let's start with public building block sets, but I am confident we can persuade Enamine of the importance of building good models on their building blocks library, or a key subset of it. I'll work on getting us approval.

peastman commented 2 years ago

I'll work on getting us approval.

That would be great, thanks!

What about DrugBank? Am I correct in concluding it's off limits to us? If so, is ChEMBL a good alternative?

jchodera commented 2 years ago

We could try pulling subsets from ZINC, eg https://zinc.docking.org/catalogs/dbfda/substances/

peastman commented 2 years ago

What are the ideal properties we want in these molecules? In the near term, the number of molecules we can process will be in the tens of thousands at most, which is only a tiny fraction of what's in ZINC or ChEMBL. Given a large database, we can filter the molecules by size and to maximize diversity. Beyond that, what should we look for in selecting them?

jchodera commented 2 years ago

ZINC15 contains curated downloadable subsets, like "all FDA approved compounds", that we could start with. That's why I pointed to that page.

In order of priority (high to low), we want to do a good job modeling molecules that

peastman commented 2 years ago

I think this is the page you intended to link to?

https://zinc.docking.org/substances/subsets/

Our goal here is to sample a wide cross section of chemical space, such that a model trained on it will do a decent job of extrapolating to novel molecules. We want to enable general purpose potential functions that can accurately model any molecule someone throws at it, not just molecules it was specifically trained on. Limiting it to FDA approved drugs wouldn't serve that goal. In fact, that's probably one of the least diverse subsets. In version 2 of the dataset it might be worth adding them anyway to try to get better accuracy on them, if we think they're especially important and if the accuracy isn't already good enough. (The whole FDA subset is only 1379 molecules, so not very much.) But for the initial core dataset, the key goal is to have broad coverage of as much chemical space as possible.

Here's my suggestion:

jchodera commented 2 years ago

OK, let's take your suggestion and run with it a bit: We don't want diverse molecules, we want diverse chemical environments. Is there a way we can instead try to select a set of molecules that fills out the diversity of environments instead of just trying to pick dissimilar molecules?

Could we use a fingerprint that is sensitive to local differences in chemical environment and try to pick the smallest set of small molecules that covers the most diverse environments? If we're going for ANI or SchNet, perhaps we could even use ANI/SchNet chemical environment vectors to cluster/measure dissimilarity in environments?

peastman commented 2 years ago

I think that's what fingerprints will give you. Each bit in the fingerprint corresponds to a local structure. If we were instead interested in global structure, I would suggest categorizing the molecules by scaffold, but I agree that would be less useful.

jchodera commented 2 years ago

I think this is only true if you use the bit-vector representation, rather than hashed or folded representations. However, it will not tell you how "similar" different local environments are---they just get binned into separate bit bins. That's why using the atomic environment vectors may be a better way to, well, determine whether we are covering the right distribution of atomic environment vectors directly.

peastman commented 2 years ago

In a sense the bits in a fingerprint and the AEVs in an ANI model are conceptually very similar to each other. Each describes the local environment around an atom. The main difference is that the former is based on covalent bonding, while the latter is based on distance in some particular conformation.

In another sense, though, they're quite different. Fingerprints are specifically designed for comparing molecules. You compute the similarity between two fingerprints and you directly have a measure of whether those molecules have similar local environments. With AEVs it's less clear. Given two molecules with different numbers of atoms, and a high dimensional vector for each atom in each molecule, how do you turn it into a single number describing the similarity of the molecules? We'll have to invent our own method, explain it to users, and perform experiments to justify that it works. Whereas if we use fingerprints, that's a standard method that's widely used in the literature and is already known to work well.

There's also a huge difference in computational efficiency. Any method we come up with for computing distances between molecules based on the AEVs of all their atoms will be orders of magnitude slower than computing a Tanimoto similarity between two fingerprints. That's important, since we'll be sorting a few hundred thousand molecules with an O(N^2) algorithm. Using fingerprints it will take a few hours. (I've done roughly the same thing before in other contexts.) With AEVs it will take days to weeks. And remember that the size of an AEV scales as O(N^2) in the number of elements, and that we want to include a lot of elements.

jchodera commented 2 years ago

This approach has swung so far away from the "this is the task we want to perform well on" and to the "this is the thing very easy for me to do" territory that I feel compelled to ask "how can you at least monitor if we're doing a good job on what we intend to do, which is accurately model the first two use cases (FDA and PDB) listed above? If we can at least have a plan to assess whether we're totally failing at those tasks, it seems fine to proceed with the "art of the possible" plan above.

peastman commented 2 years ago

what we intend to do, which is accurately model the first two use cases (FDA and PDB) listed above?

For you, FDA approved drugs may be especially important. For a lot of other people it will be much more important to simulate whatever novel drugs they're developing. Here is the specific goal we agreed on at the meeting last month.

Initial goal: Simulate drug-like small molecules interacting with proteins (including water and ions).

I'm trying to design a dataset that is well suited for that purpose. The word "initial" is critical, because this dataset is supposed to grow with time. Right now we're designing version 1, which will serve as the initial core dataset to bootstrap the process. Once we have it, we can train models on it and see how well they do. That will inform the creation of version 2, in which we add more data to improve performance in the cases where they work badly. Then repeat.

jchodera commented 2 years ago

For you, FDA approved drugs may be especially important. For a lot of other people it will be much more important to simulate whatever novel drugs they're developing. Here is the specific goal we agreed on at the meeting last month.

No, they're not important for me. I'm stating the obvious: Compounds with structural data available are going to be orders of magnitude easier for our users to model than compounds without structural data. Many more users will simulate compounds with structural data available than those without, because it is orders of magnitude more difficult to model complexes involving compounds without structural data. I don't know why this is a difficult or contentious topic.

jchodera commented 2 years ago

I'm trying to design a dataset that is well suited for that purpose. The word "initial" is critical, because this dataset is supposed to grow with time. Right now we're designing version 1, which will serve as the initial core dataset to bootstrap the process. Once we have it, we can train models on it and see how well they do. That will inform the creation of version 2, in which we add more data to improve performance in the cases where they work badly. Then repeat.

And how will we measure "improvement"? Of what metric? On what dataset?

peastman commented 2 years ago

And how will we measure "improvement"? Of what metric? On what dataset?

Compute forces and energies for other molecules and/or conformations not in the dataset and see how accurate the model is. We also discussed active learning approaches where you train an ensemble of models and use the variation between them to identify molecules/conformations where the existing training data isn't sufficient.

jchodera commented 2 years ago

Compute forces and energies for other molecules and/or conformations not in the dataset and see how accurate the model is. We also discussed active learning approaches where you train an ensemble of models and use the variation between them to identify molecules/conformations where the existing training data isn't sufficient.

Great. How about we at least grab the <= 50 atom molecules from the FDA and PDB datasets into two separate datasets we can monitor as "molecules not in the dataset we'd like to do well on". These sets should be small, after all. Then we can have the best of both worlds: Diverse molecules from ZINC and subsets of FDA and PDB to check our assumptions?

jchodera commented 2 years ago

Hm, it seems that Enamine is providing their REALSpace sets freely downloadable online (no restrictions noted): https://enamine.net/compound-collections/real-compounds/real-database#

Perhaps the sets under 50 atoms would also be good to cluster?

peastman commented 2 years ago

How about we at least grab the <= 50 atom molecules from the FDA and PDB datasets into two separate datasets we can monitor as "molecules not in the dataset we'd like to do well on".

Sure, sounds like a good plan.

it seems that Enamine is providing their REALSpace sets freely downloadable online

I couldn't find any licensing information, and when I clicked to download part of it, the site asked me to log in. A bit more searching led me to the paper describing the database. Here's what it says about availability.

The complete lists of reagents used to construct the chemical space supporting the current study have not been deposited in a public repository owing to the company's policy but are available from the corresponding author on request. There are restrictions on the availability of the in-house code and the synthon lists with the reactivity features that have been used to generate the chemical space owing to commercial confidentiality reasons.

peastman commented 2 years ago

I couldn't find any licensing information on the ZINC website, but I found a page for it at AWS, which says,

ZINC is free as in beer. You may not redistribute without the written permission of John Irwin

That clearly won't work unless we can get permission.

I did find licensing information for ChEMBL. They use CC-SA. The legality is a bit ambiguous, but that could reasonably be interpreted as saying our own dataset, any model trained on that dataset, and any software containing one of those models must itself be CC-SA licensed. That would also be problematic.

peastman commented 2 years ago

I think the best plan is to write to Enamine, ChEMBL, and ZINC and ask for explicit permission to include molecules from their databases in our dataset (and distribute it under the BSD license used by QCArchive). Since you have contacts at Enamine, it's probably best for you to talk to them. I can write to ChEMBL and ZINC, unless you have contacts there too.

I suppose we could also try DrugBank, though they may be less open to the idea. Unlike the others, they make money by charging for access to their database.

peastman commented 2 years ago

A couple of other sources to consider.

PubChem is run by NIH, so everything they do is in the public domain. Here's what their policies say about it:

Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution. However, some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and since there is no transfer of rights from submitters to NCBI, NCBI has no rights to transfer to a third party.

That's probably about the best we can ask for. The main question would be figuring out how to assemble an appropriate collection of molecules. They have 111 million compounds total, but the organization of them isn't great.

ChemSpider is a similar size database. Their terms are more restrictive though. From the FAQ for the question, "Can I download the complete dataset?"

You can assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. If you want to do more than this, please contact us for help – we don’t make the entire set available for free download.

If nothing else works out, we can certainly do something reasonable with PubChem. I wrote to ChEMBL and ZINC yesterday, but I haven't received any reply yet from either.

peastman commented 2 years ago

I've been digging into PubChem a bit more. They pull data from many sources. For each molecule, they track where it came from and what license terms are imposed by each source. Possibly the easiest approach would be to just identify a list of sources that have no restrictions and tend to provide drugs or drug-like molecules, and retrieve everything that came from those sources. Here are some likely candidates.

That's several million compounds right there.

peastman commented 2 years ago

It looks like it would be pretty easy to do what I described above. You can download a SDF file of all the molecules from a given source. From there, it's easy to filter them and extract the SMILES strings and IDs, which is all I need.

Shall I go ahead and start on it? Any suggestions on which sources are mostly valuable?

peastman commented 2 years ago

I downloaded all the substances for BindingDB and ChemIDplus (a bit over 1.4 million) and applied the following filters:

That left 588,999 molecules (400,816 from BindingDB, 188,183 from ChemIDplus), which is more than enough for our purposes. We'll only be able to actually use a tiny fraction of them. ChemIDplus has anything that is cited in the National Library of Medicine, so this should include lots of approved and experimental drugs.

Unless someone speaks up, I'm going to move forward with these molecules.

peastman commented 2 years ago

Have you had a chance to talk to Enamine? Any word about what they'll give us permission to use?

jchodera commented 2 years ago

I think the best plan is to write to Enamine, ChEMBL, and ZINC and ask for explicit permission to include molecules from their databases in our dataset (and distribute it under the BSD license used by QCArchive).

@peastman : I'm in contact with Enamine right now. Which license do we want to use? You mention "BSD", but that's a license for software, right? Don't we want something like CC0 or CC-BY for the text associated with a SMILES archive?

peastman commented 2 years ago

According to https://qcarchive.molssi.org/privacy/, each "QCArchive project" is covered by the BSD 3-clause license. But perhaps that's supposed to refer to the software they write, not the datasets in the archive? Anyway, I'm happy to go with something as permissive as possible. CC0 or CC-BY would both be fine.

jchodera commented 2 years ago

@bennybp: We wanted to clarify the licensing policy for data hosted on QCArchive. See the comments above. What license(s) should we secure for molecule identities fed into the QCFractal workflow to make sure all licenses are compatible?

I've been following the Reproducible Research Standard, which recommends CC0 or the Science Commons Open Data Protocol for datasets.

peastman commented 2 years ago

Closing since version 1 is now released.