leeping / forcebalance

Systematic force field optimization.
Other
145 stars 75 forks source link

Experimental Data File Specification #59

Open leeping opened 10 years ago

leeping commented 10 years ago

Starting a new topic because this focuses on how the experimental data files are stored on disk (these files are created / edited by the user), rather than how they are parsed or stored in memory. Note that we can't easily go the other way (i.e. convert from the internal representation to data tables on disk), but that's presently not an intended use case.

If you are thinking of a nontrivial data type, please mention it here and we can discuss how it can fit into this framework.

Specification for experimental data file format in ForceBalance (updated 4/1/2014)

Context and Purpose: In ForceBalance, I presently have an existing, fairly rigid format for storing liquid and lipid bilayer properties (Liquid and Lipid targets), and a simple format for specifying general thermodynamic properties (Thermo target) as contributed by @ebran.

The following spec attempts to encompass more types of physical / chemical experimental data for force field parameterization, so that we can incorporate a larger number of properties into the Thermo target. The format is intended to be highly flexible for different data types without compromising on simplicity.

File Formats: Three formats will be available.
leeping commented 10 years ago

@ebran, @jchodera and @kyleabeauchamp are especially asked to chime in. :)

leeping commented 10 years ago

Well, one experiment that wouldn't fit into this framework is a 2DIR spectrum since it's not a single column but rather an entire table. I probably won't be fitting 2DIR data in the near future, but is there anything else similar to this?

ebran commented 10 years ago

Hi Lee-Ping,

I like the overall structure of the suggested format of the experimental data file. Most of the things we discuss are covered. A couple of comments:

Best, Erik

jchodera commented 10 years ago

Have you seen standard experimental data file specifications, like ISATAB?

Also, many of our instruments generate XML formats natively, but I suppose there will always be a need for an intermediate processing step.

leeping commented 10 years ago

Erik: Thanks. Regarding metadata: It has to go somewhere. Are you suggesting this can go in the ForceBalance input file or in a separate file in targets? The experimental uncertainties are a property of the data so I would prefer to keep it in the same file if possible. The quantum corrections are more a property of the force field, so we can move those to the input file.

Regarding column width: The fixed width formatting helps when there is missing data, or if we want to include array data in the same table. Otherwise the data could be read into the wrong column.

snap00036

Regarding ensembles: The only potential case for ambiguity in the near future is running NVT simulations and measuring the pressure tensor. Constant chemical potential simulations are still a ways off. The volume / density cannot be specified in the data table anyway, so it's really just the pressure. I think we can introduce the observable name pressure_obs and this won't add any layers of complexity.

John: I should have known there's a more general format out there. :) But I looked at the ISATAB example files and it looks like most of the measurements are biological, and somewhat different from the "fittable" properties that we'd like to include.

Our measurements are more physical / chemical in nature and I think they're a lot simpler than many biological experiments. By specifying our own format, one advantage is that we can have it "make sense" for ourselves and still keep it simple.

Now that I've looked at their tables, it's clear our format cannot encompass all experiments; thus I should probably just restrict the "purpose" statement above.

XML format is great for computers to read/write but not easy for people to read/write. Since the simplest applications involve typing in a few entries by hand, I agree an intermediate processing stage for XML would be good (we could bundle the converters in bin or extras if desired).

leeping commented 10 years ago

More notes regarding fixed width: It shouldn't be a hard-coded fixed width like the PDB. We should determine the column width from the header line.

I think both left-justified and right-justified text are okay since the parser could figure this out, but right-justified text is more natural for the parser (the end of each word in the header line determines the column width).

ebran commented 10 years ago

I agree with that. Clarification on metadata: I meant that I don't think metadata inside a table should be allowed. But before or after.

-- Erik

On 2014-04-01 17:22, Lee-Ping wrote:

More notes regarding fixed width: It shouldn't be a hard-coded fixed width like the PDB. We should determine the column width from the header line.

I think both left-justified and right-justified text are okay since the parser could figure this out, but right-justified text is more natural for the parser (the end of each word in the header line determines the column width).

— Reply to this email directly or view it on GitHub https://github.com/leeping/forcebalance/issues/59#issuecomment-39218361.

Erik G. Brandt Postdoctoral Researcher Stockholm University

kyleabeauchamp commented 10 years ago

IMHO tab or whitespace delimited files are a good compromise of easy parsing and easy reading. CSV files are very foolproof against formatting errors, but are a bit harder to read.

Avoid anything that looks like PDB, as PDB is complete garbage.

leeping commented 10 years ago

Erik: I think it was more from a parsing perspective; if we explicitly disallow metadata inside the table, the parser will need to know "am I before the table, inside the table, or finished reading the table"? Otherwise, the parser only needs to know "have I read the header yet?" which is simpler. We can explicitly disallow it though.

Kyle: Okay. Since my proposed fixed width format is different from the PDB, I am assuming you're okay with it. :)

kyleabeauchamp commented 10 years ago

yes

jchodera commented 10 years ago

I think metadata, especially provenance data, is absolutely essential. It's critical the source of the data is traceable.

Do you envision having a separate data file for each source, or could one data file have data from multiple sources?

I think tab, csv, or whitespace delimited files are garbage. They are too inflexible to represent the fact that the data may be pretty sparse over the conditions you are interested in, they are not really machine-readable in any complex way (e.g. how do you specify units?!), and they barely qualify as human readable.

jchodera commented 10 years ago

I think your best approach would be to have an XML file where you can easily specify sparse measurements of the same property potentially culled from multiple data sources, with as much additional metadata as possible about the conditions, units, error, and provenance in a computer-readable format.

jchodera commented 10 years ago

Weights are entirely divorced from the data, and reflect your own decisions about how to balance different classes of data or data sources. Those don't belong in a file describing experimental data.

jchodera commented 10 years ago

Also, True/False switches for QM corrections, and in fact anything related to the model you use for the experimental observables, should be separate.

You probably really want several distinct sets of files:

By separating things this way, you can mix and match different schemes and observable models. If we later use a Bayesian optimization scheme, we just swap out the part that controls how the optimization is carried out.

kyleabeauchamp commented 10 years ago
  1. I agree that a huge issue with flat text approaches is the curation of metadata

I agree that XML is nice, but I do think there is value in allowing people to edit data in non-XML form, e.g. with Excel. I think this helps eliminate errors during data entry. Also for convenience.

Regarding units in text-based formats: it might be possible to create an automated string-based system of column names that encode units.

jchodera commented 10 years ago

I agree that XML is nice, but I do think there is value in allowing people to edit data in non-XML form, e.g. with Excel. I think this helps eliminate errors during data entry. Also for convenience.

I think this was exactly the reasoning behind ISATAB. Regarding units in text-based formats: it might be possible to create an automated string-based system of column names that encode units.

You could still have a tool that goes back and forth between xml and text formats, or allows you to generate text summaries to check against the original data.

Really, some sort of validation scheme is required to guard against data entry errors. Just having a text file or excel format doesn't do that.

leeping commented 10 years ago

Hi John,

You probably really want several distinct sets of files:

  • Experimental data, uncertainties, provenance, observable classes
  • Observable models and parameters
  • Weights and other things that control how you do the optimization, including priors

We need to have formal correctness without sacrificing the ability to keep things simple. I think an acceptable solution is to support multiple files but still allow single files. This can be done by specifying one or multiple "source data" file names in the ForceBalance input file.

I need to think more about observable models and parameters. Many observables I've looked at are "measured" directly from the simulation. The QM corrections are simply an extra column that corresponds to each set of experimental conditions, while others (such as the Karplus relation) require a model which is coded into ForceBalance or MDTraj, and have adjustable parameters. This needs to be accounted for in such a way that it doesn't encumber users who don't use these models.

One possible solution is to have the choice of observable model in the ForceBalance input file, and store model parameters in the target folder. We may hard-code default values into ForceBalance as I think some of these models are standardized.

I think metadata, especially provenance data, is absolutely essential. It's critical the source of the data is traceable.

Definitely agree, we can use the comments for this. The comments may contain the primary citation, which is responsible for describing how the experiments were done.

Do you envision having a separate data file for each source, or could one data file have data from multiple sources?

In the simple optimization runs, we have a single file that contains all of the data. For more complicated runs, ForceBalance can read from multiple files, and the main data file can reference other files.

I think your best approach would be to have an XML file where you can easily specify sparse measurements of the same property potentially culled from multiple data sources, with as much additional metadata as possible about the conditions, units, error, and provenance in a computer-readable format.

We can work on XML support as a fourth proposed format but I will support the three above formats (.csv, tab delimited, fixed width) first, as they are the most relevant for currently running projects. The XML format could be more rich in provenance information than the others, but it won't make a difference for the optimization.

Weights are entirely divorced from the data, and reflect your own decisions about how to balance different classes of data or data sources. Those don't belong in a file describing experimental data.

Yes, we should support reading from two separate files, one with the data itself and one with corresponding rows for the weights (row indices are needed, as they are global for the whole Target).

how do you specify units?

This was described at the top. I think we can have a mapping from the table units (which may be small, like kJ/mol) to a SimTK quantity (kilojoule_per_mole). I don't want to have the latter in the table because it would make the columns unnecessarily wide, so we could have a mapping from abbreviated units to full units.

jchodera commented 10 years ago

I need to think more about observable models and parameters. Many observables I've looked at are "measured" directly from the simulation. The QM corrections are simply an extra column that corresponds to each set of experimental conditions, while others (such as the Karplus relation) require a model which is coded into ForceBalance or MDTraj, and have adjustable parameters. This needs to be accounted for in such a way that it doesn't encumber users who don't use these models.

An "experimental observable model" specifies everything needed to produce a number that can be compared directly with compiled experimental measurements for a given set of conditions. This would include a mechanical observable A(x) to average over configurations x or properties that are derived from free energy differences between multiple thermodynamic states, any QM corrections that are required, and the like. One might choose between several different models for the same experimentally measured property, but this shouldn't change the value of the experimental property---that's why we would want to separate this model from the data.

For QM-derived quantities that are not experimental observables---these are something entirely different altogether. For example, you probably have lots of water dimer geometries with different associated QM-computed energies. Specifying this requires something like mol2 files for the individual components and information on geometries and QM energies that should be compared to MM energies. Similarly, for torsion drives, something similar is required, but for a single molecule.

jchodera commented 10 years ago

Definitely agree, we can use the comments for this. The comments may contain the primary citation, which is responsible for describing how the experiments were done.

Have you seen my slides from the CADD GRC the DDT report?

If you haven't seen it, this report is worth looking through: USGS Water-Resources Investigations Report 01-4201, 2001. http://pubs.usgs.gov/wri/wri014201/pdf/wri01-4201.pdf

Be sure to read the abstract and look through the figures.

Since it's so prevalent and incredibly toxic, the aqueous solubility of DDT is one of the more important experimental measurements that humans have conducted, and yet, see how hard it is to actually find sensible literature values?

leeping commented 10 years ago

An "experimental observable model" specifies everything needed to produce a number that can be compared directly with compiled experimental measurements for a given set of conditions. This would include a mechanical observable A(x) to average over configurations x or properties that are derived from free energy differences between multiple thermodynamic states, any QM corrections that are required, and the like. One might choose between several different models for the same experimentally measured property, but this shouldn't change the value of the experimental property---that's why we would want to separate this model from the data.

That's fair. The model cannot be contained entirely in these files because it includes the code to compute the observable from the simulation results. Also, for many observables the user doesn't really need to consider the model because it is so straightforward (e.g. the density).

We can separate the model parameters from the data by having optional multiple files, and the user may choose which model to use in the ForceBalance input file, but there should exist default models and parameters (in the code) that are invoked when the user doesn't specify anything.

Have you seen my slides from the CADD GRC the DDT report? If you haven't seen it, this report is worth looking through: USGS Water-Resources Investigations Report 01-4201, 2001. http://pubs.usgs.gov/wri/wri014201/pdf/wri01-4201.pdf

I just read it. The differences between experimental values over the years is astounding, as is the number of multi-level references. It means the later authors didn't bother to look at their references carefully enough. I agree we should choose our sources carefully and document them in the table, so at least it's possible for someone to double check later on.

leeping commented 10 years ago

QM data is somewhat of a separate topic. I have been storing QM data using multiple files - one file that contains a sequence of structures (using standardized formats) and another space-delimited plain text file called qdata.txt containing energies, forces, interactions, etc. These files tend to be much larger (e.g. 10,000 structures).

However, this cannot encompass all types of QM data (e.g. binding energies, which are a function of multiple structures, or vibrational frequencies / eigenvalues). I have some specific formats for these target types, with examples and accompanying documentation. They have served their purpose so far.

I think the formats for storing QM data could be improved following what we learn in specifying experimental data, but for now we can keep it as is. The qdata.txt format is from my grad school days - it has some deficiencies but it is also quite simple, which I think is a good thing.

ebran commented 10 years ago

Perhaps it's best to stick to simplicity and allow it then.