isayev / ASE_ANI

ANI-1 neural net potential with python interface (ASE)
MIT License
220 stars 56 forks source link

Full training set? #8

Closed proteneer closed 7 years ago

proteneer commented 7 years ago

We're waiting on the full dataset (with augmentation via normal mode sampling) to allow us to do complete validation of the ANI model - @lilleswing recommended using gitLFS for distribution.

Thoughts?

ghutchis commented 7 years ago

For one, you probably need a separate repository. How much data is it?

Jussmith01 commented 7 years ago

Hello,

We are currently in the process of publishing the data descriptor and data set. We will be submitting before the weekend. The data should be available shortly after. We will add a link on this repo's readme when we have it to share. Thanks!

isayev commented 7 years ago

Hey, @proteneer and @ghutchis : this is multigiabyte data set. We will provide a simple python package to read and slice data.

ghutchis commented 7 years ago

@isayev - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it.

hlwoodcock commented 7 years ago

Hi All - If anyone's school uses google for email services then Google Drive should offer unlimited storage and an easy option for sharing.

On Tue, Aug 1, 2017 at 6:55 PM, Geoff Hutchison notifications@github.com wrote:

@isayev https://github.com/isayev - I was guessing. But GitHub (even with LFS) may not be the ideal place to store it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/isayev/ASE_ANI/issues/8#issuecomment-319519879, or mute the thread https://github.com/notifications/unsubscribe-auth/ALmJwScVobH6DP240cCd5kuW15I7xYrBks5sT6zbgaJpZM4OqDu8 .

andersx commented 7 years ago

For a permanent solution I suggest storing the dataset somewhere that allows for a DOI. Not sure if you can get this using Google drive. Maybe datadryad.org? I guess it also depends on how much data we are talking about, and if you are willing to spend money on hosting at all.

proteneer commented 7 years ago

What type of information is in the training set? Atomic coordinates, types, and predicted QM energies? Bond orders? Topologies? SMILES?

We could also consider hosting it ourselves as well as mirror.

isayev commented 7 years ago

@andersx yup, we will host it with DOI! @proteneer this data is xyz file like. We have 3D array containing cartesian coordinates for each conformer of the molecule, vector of atom species and vector of energies. We don't use bond orders or topologies.

proteneer commented 7 years ago

I see - I presume you're tossing out formal charges as well then? If you're also throwing away bond orders/topologies it might be a little difficult to do reconstruction/debugging but we can live with xyz-like for now. You can probably get away with 16bits of precision + compression to reduce the file sizes (internally we use something like the gromacs XTC format + gzip with 16bits to drastically reduce sizes).

isayev commented 7 years ago

Whole point of this approach is to be QM-like! It does not rely on anything but element species and coordinates. We run and successfully converged all systems in DFT. All of them are neutral. I think we also have SMILES strings too. However for every SMILES there will be an ensemble (~100-1000) of 3D conformations.

Dataset will be available as (lossless) HDF5 file with a python wrapperclass

proteneer commented 7 years ago

Okay - works for us.

proteneer commented 7 years ago

Closing - thanks guys!