How to organize and distribute data

peastman commented 2 years ago

Because of limitations in qcfractal, we had to break up the dataset into many pieces and generate them separately. It's still intended to be a single dataset, though. How can we make it easy for someone to obtain the entire dataset all at once?

A related issue is that we'll continue generating data and producing regular versioned releases. A paper might say, "The model was trained on SPICE version 2." A person who wants to reproduce the work needs an easy way to download that version and be sure they're getting exactly the data used in the paper.

How can we do this?

jchodera commented 2 years ago

@pavankum @dotsdl @jthorton may know more about how we can do this.

pavankum commented 2 years ago

A paper might say, "The model was trained on SPICE version 2."

Yeah, we agree with this, we discussed the same in last QCF meeting and the consensus is to download the data and upload to zenodo with "date-accessed", that would decouple it from any changes to the living database, the updates to the software to access those datasets, and adding more completed calculations. Any other suggestions are welcome.

It's still intended to be a single dataset, though.

Once we figure out the hdf5 output it should be easy to patch together data from multiple datasets if that's intended. @dotsdl / @jthorton may add more.

peastman commented 2 years ago

There's been some relevant discussion over at #11, particularly @jthorton's suggestion that we handle this with an API in qcsubmit. I'm replying here since it's more on topic for this issue.

The main thing is to make sure it's easy for users. Anything that requires them to write code just to download the dataset creates a huge barrier. If it requires them to learn the API for qcsubmit or qcportal, that's an even bigger barrier.

One option is to provide the data in multiple forms. We can provide a downloader application that internally uses qcsubmit and/or qcportal, but doesn't require the user to write any code. They just edit a configuration file to specify what data they want and run it. Then we can use that program to generate a file containing only the most important information. Most users can just download that file, which should be only a few GB. People who want less common data fields can run the downloader.

giadefa commented 2 years ago

what about a version variable in the hdf5 file?

On Thu, Apr 21, 2022 at 6:02 PM Peter Eastman @.***> wrote:

There's been some relevant discussion over at #11 https://github.com/openmm/spice-dataset/issues/11, particularly @jthorton https://github.com/jthorton's suggestion https://github.com/openmm/spice-dataset/issues/11#issuecomment-1104879379 that we handle this with an API in qcsubmit. I'm replying here since it's more on topic for this issue.

The main thing is to make sure it's easy for users. Anything that requires them to write code just to download the dataset creates a huge barrier. If it requires them to learn the API for qcsubmit or qcportal, that's an even bigger barrier.

One option is to provide the data in multiple forms. We can provide a downloader application that internally uses qcsubmit and/or qcportal, but doesn't require the user to write any code. They just edit a configuration file to specify what data they want and run it. Then we can use that program to generate a file containing only the most important information. Most users can just download that file, which should be only a few GB. People who want less common data fields can run the downloader.

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/issues/21#issuecomment-1105417534, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOTOAG5S7FI77UPHGBTVGF32TANCNFSM5T5CHKZA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

peastman commented 2 years ago

Here's an updated version of the downloader that uses a configuration file to specify what to include.

from qcportal import FractalClient
from collections import defaultdict
import numpy as np
import h5py
import yaml

with open('config.yaml') as input:
    config = yaml.safe_load(input.read())
client = FractalClient()
outputfile = h5py.File('SPICE.hdf5', 'w')
for subset in config['subsets']:
    print('Processing', subset)
    ds = client.get_collection('Dataset', subset)
    all_molecules = ds.get_molecules()
    spec = ds.list_records().iloc[0].to_dict()
    recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
    recs_by_name = defaultdict(list)
    mols_by_name = defaultdict(list)
    for i in range(len(recs)):
        rec = recs.iloc[i]
        index = recs.index[i]
        name = index[:index.rfind('-')]
        recs_by_name[name].append(rec.record)
        mols_by_name[name].append(all_molecules.loc[index][0])
    for name in recs_by_name:
        group = outputfile.create_group(name)
        group_recs = recs_by_name[name]
        molecules = mols_by_name[name]
        qcvars = [r.extras['qcvars'] for r in group_recs]
        group.create_dataset('subset', data=[subset], dtype=h5py.string_dtype())
        group.create_dataset('smiles', data=[molecules[0].extras['canonical_isomeric_explicit_hydrogen_mapped_smiles']], dtype=h5py.string_dtype())
        group.create_dataset("atomic_numbers", data=molecules[0].atomic_numbers, dtype=np.int16)
        conformations = group.create_dataset('conformations', data=np.array([m.geometry for m in molecules]), dtype=np.float32)
        for value in config['values']:
            key = value.lower().replace(' ', '_')
            group.create_dataset(key, data=np.array([vars[value] for vars in qcvars]), dtype=np.float32)

Here is the configuration file for it.

subsets:
  - 'SPICE Solvated Amino Acids Single Points Dataset v1.1'
  - 'SPICE Dipeptides Single Points Dataset v1.2'
  - 'SPICE DES Monomers Single Points Dataset v1.1'
  - 'SPICE DES370K Single Points Dataset v1.0'
  - 'SPICE DES370K Single Points Dataset Supplement v1.0'
  - 'SPICE PubChem Set 1 Single Points Dataset v1.2'
  - 'SPICE PubChem Set 2 Single Points Dataset v1.2'
  - 'SPICE PubChem Set 3 Single Points Dataset v1.2'
  - 'SPICE PubChem Set 4 Single Points Dataset v1.0'
  - 'SPICE PubChem Set 5 Single Points Dataset v1.0'
  - 'SPICE PubChem Set 6 Single Points Dataset v1.0'
values:
  - 'DFT TOTAL ENERGY'
  - 'DFT TOTAL GRADIENT'
  - 'MBIS CHARGES'
  - 'MBIS DIPOLES'
  - 'MBIS QUADRUPOLES'
  - 'MBIS OCTUPOLES'
  - 'SCF DIPOLE'
  - 'SCF QUADRUPOLE'
  - 'WIBERG LOWDIN INDICES'
  - 'MAYER INDICES'

It's a lot faster than the previous version. I estimate that downloading the entire dataset will take less than a day. Most of the time is now spent in ds.get_records(). Is there any way to speed that up?

If we remove all the values except 'DFT TOTAL ENERGY' and 'DFT TOTAL GRADIENT', the complete dataset should only be 1-2 GB. We can easily create that version and post it somewhere for people to download.

jchodera commented 2 years ago

@peastman: Instead of posting the script to retrieve the data from QCPortal and produce a standardized HDF5 file for distribution in an issue, can you open a PR to add this script? We can discuss the format there, document the script, ensure that the format you generate is also documented, and ensure that everything is consistent when we cut a release.

@jeherr @yuanqing-wang : Can you take a look at this format and see if this will be useful to you for applications like espaloma?

@bennybp: Would we be able to host the HDF5 file on the QCArchive Machine Learning Dashboard with the other QCArchive datasets?

peastman commented 2 years ago

It's just a proof of concept. First we need to agree on whether this is even the approach we want to take.

jchodera commented 2 years ago

@peastman : Can you put this script into a PR so we can get it pulled into the repository? Folks like @yuanqing-wang could also use this to retrieve the data needed for generating figures for the paper.

peastman commented 2 years ago

What do people think of this approach? Is this script and the approach described above in https://github.com/openmm/spice-dataset/issues/21#issuecomment-1105417534 the way we want to go?

pavankum commented 2 years ago

Yeah, your script looks great, @bennybp offered to run it server side for faster access, @dotsdl may add more

dotsdl commented 2 years ago

@peastman and @jchodera: I agree with the approach of using the script above to export data in a form most useful to this project. I think we should proceed with it. And @bennybp is happy to run a final export server-side if speed is an issue.

@pavankum, @bennybp, and I are aware of the desire for a recommended path for this citeable datasets. For dataset types beyond single-point calculations (like those we are exporting here) this becomes much thornier, and getting a general solution right will be a project that can likely be taken on at MolSSI later this year.

Any solution for the above will be based on the next QCFractal iteration, which we are currently testing for deployment, not on the currently-deployed QCFractal codebase that is running on public QCArchive now.

peastman commented 2 years ago

Thanks. I'll clean it up a bit and check it in.

peastman commented 2 years ago

Closing since version 1 is now released. The downloader folder contains a script for downloading, and there's a file containing the most commonly used data fields attached to the release.

jchodera commented 2 years ago

Why wouldn't we? Many other quantum chemical datasets for machine learning are also distributed via the QCArchive ML dashboard, and downloading a 2G file is orders of magnitude faster than the 12h API based retrieval from QCPortal. Is further discussion on the idea needed?

We just need to finalize all the details of storage format, documentation, precision, compression, which data is included, how usable the result is, etc.

peastman commented 2 years ago

What are you talking about? We do already provide a file for people to download.

openmm / spice-dataset

How to organize and distribute data #21