Citation of protein fragment dataset

openmm / spice-dataset

A collection of QM data for training potential functions

MIT License

133 stars 6 forks source link

Citation of protein fragment dataset #58

Closed davkovacs closed 1 year ago

davkovacs commented 1 year ago

I think this dataset from PhysNet would be worth at least a mention in the paper and perhaps could be considered for recomputing and adding (part of) it to SPICE:

https://zenodo.org/record/2605372#.Y3uiNS8w0Q0

peastman commented 1 year ago

Does that dataset provide any information beyond what we already have in the dipeptides and solvated amino acids subsets? It looks like they mainly designed it to keep the molecules as small as possible so the DFT calculations would be fast. Their fragments have a maximum of eight heavy atoms, and even with solvent there's a maximum of 21 heavy atoms. That makes it fast, but at the cost of providing a less realistic representation of proteins than what you get with complete amino acids and dipeptides.

davkovacs commented 1 year ago

It contains many different things. Including fragment-fragment interactions, different protonation states (up to plus minus 2 overall), water clusters (protonated and deprotonated) up to size of 40 water molecules and importantly overall 2.7 million structures.

I will recompute a small sample of it with PSI4 and test a SPICE trained model on it to see if the errors are significantly greater then in a SPICE test set. If that will be the case then we can be sure it is sampling parts of configuration space not currently covered by SPICE.

Anyway, I thought you might not know about this dataset since it was not mentioned in the paper and I think it is relevant. Happy to move this into Github Discussions if you enable that feature, I am not sure an issue is the best place to discuss it.

davkovacs commented 1 year ago

FYI the paper contains a detailed discussion of the dataset and how it was generated. https://pubs.acs.org/doi/pdf/10.1021/acs.jctc.9b00181

peastman commented 1 year ago

They generated their conformations by simulating each molecule for 100 fs and saving a conformation every 1 fs. That's far too short to sample much beyond bond length and angle oscillations. And they started each simulation from an energy minimized conformation rather than an equilibrium one. They have a lot of conformations, but they're all clustered in a small region around the energy minimum.