Experimental Data File Specification

leeping commented 10 years ago

Starting a new topic because this focuses on how the experimental data files are stored on disk (these files are created / edited by the user), rather than how they are parsed or stored in memory. Note that we can't easily go the other way (i.e. convert from the internal representation to data tables on disk), but that's presently not an intended use case.

If you are thinking of a nontrivial data type, please mention it here and we can discuss how it can fit into this framework.

Specification for experimental data file format in ForceBalance (updated 4/1/2014)

Context and Purpose: In ForceBalance, I presently have an existing, fairly rigid format for storing liquid and lipid bilayer properties (Liquid and Lipid targets), and a simple format for specifying general thermodynamic properties (Thermo target) as contributed by @ebran.

The following spec attempts to encompass more types of physical / chemical experimental data for force field parameterization, so that we can incorporate a larger number of properties into the Thermo target. The format is intended to be highly flexible for different data types without compromising on simplicity.

File Formats: Three formats will be available.

Comma-separated values or .csv, as written by Microsoft Excel. Every line in the file - including empty lines - must contain the same number of comma separated fields.
Tab-delimited values as written by Microsoft Excel. Every line in the data table (see below) must contain the same number of tab separated fields. Metadata lines are still tab-delimited, but trailing tabs don't need to fill out the number of fields (as opposed to the case for commas in .csv). Comment lines are tab-delimited but only the first field matters anyway.
Space-delimited, fixed width format. Every line in the data table must have the same width. Note that this is the most human-readable format, but Excel may parse it in unexpected ways due to the three types of lines (see below). Comment lines or metadata lines are space-delimited and don't have to be fixed width (though preserving the fixed width improves the chances that Excel will read it correctly).
Content of the files: There are three types of lines in the experimental data file, in addition to empty lines (which are ignored).
Comment lines: The first field starts with a octothorpe (#) character. Subsequent fields are ignored.
Metadata lines: The first field is the (case-insensitive) variable name. The second field must be an equals sign (=), otherwise the line will not be recognized as metadata. Subsequent fields are the (case-sensitive) variable values. If any of the value fields starts with # (comment), then it and subsequent fields are all ignored. Examples of metadata are discussed below.
Data table (the most important): The first line is a header line with column headings. Subsequent lines are the rows in the table containing the same number of fields as the header line. Intervening comment lines are allowed, for noting experimental references and similar things. Intervening metadata lines are not allowed. Each row corresponds to: (a) one set of experimental observables taken at the same conditions, (b) one set of simulations for simulating these observables, and (c) one subfolder in targets containing initial conditions and simulation settings.
Content of the data table: The columns in the data table can be one of the following. Column headers are case-insensitive.
Index: (case-sensitive string, leftmost column) The row label; this is required. The label is also the subfolder name in targets.
Temperature of the experiment/simulation (float; default unit Kelvin). If provided, will be used in simulations and calculation of observables.
Pressure of the experiment/simulation (float; default unit atm). If provided, will be used in simulations and calculation of observables.
Observable value (float with optional unit). The column header is the observable name, and must correspond to an internally implemented method to calculate the observable from simulations. The column header and entries may optionally contain a physical unit (syntax is observable (unit) with optional space). The unit in the column header is the default for the whole column; if no units are provided, a hard-coded conventional unit will be used. An example of a unit would be 10^-4 K^-1. (Note: We should use the SimTK unit conversion system to do unit conversions, but it would be nice if we could map the abbreviated units to the full unit name. I also have issues with bundling unit with ForceBalance because these units are incompatible with the existing simtk.unit.) Also note: observable (unit) counts as a single field so the fixed-width parser must account for this.
Array observable (multiple values or file reference). This is a flexible format that accommodates complicated observables, such as a deuterium order parameter, a radial distribution function or IR spectrum. For deuterium order parameters, the list of n carbon atom names and the list of n-2 order parameters are needed. For a radial distribution function, it may be the list of atom separations, the two atom name selections used to calculate the RDF, or the list of RDF values. The data may be entered as multiple rows, in which case the other rows can be empty. Alternatively, a file reference may be provided in the form of other_file:column where the values are read from the column of the other file (numbered from one). Note that the number of entries must be consistent (i.e. the number of atom separations must equal the number of RDF values, and the number of carbons must equal the number of deuterium order parameters + 2.)
Observable weight (optional, float). The relative weight of the observable in this row with respect to the same observable in other rows; it will be normalized at the end. The column header is "w", "wt" or "wts" and applies to the observable column to the left. Defaults to one. If the experimental data is missing in this column, the simulation result will still be printed, but it won't contribute to the objective function.
Observable uncertainty (optional, float). The total uncertainty of the observable in this row (experimental + simulation uncertainty). The column header is "s", "sig" or "sigma" and applies to the observable column to the left. This overrides the global uncertainty in the metadata. Note: Mathematically speaking, the inverse square of the uncertainty is a multiplicative weight - however, it is conceptually different from the above weight which corresponds to our own expectations of which data points are more important.
In any case where the column applies to the observable / temperature / pressure column to the left, it may be prepended with observable_ to override this behavior and apply to the chosen observable instead.
Number of independent initial conditions (integer). The column header is "n_ic". The number of independent simulations that are launched to compute this observable in parallel, and must correspond to existing initial condition files on disk.
More specialized behavior:
- MBAR may be used to combine statistics from simulations at different thermodynamic conditions (rows) to improve the statistical precision of each. This only makes sense for compatible simulations (same number of molecules). This is different from the MBAR that may be used to calculate solvation free energies. The column heading is MBAR, and the entries are integers corresponding to groupings of simulations that MBAR is applied to (0 or None means MBAR is not used.)
- Quantum corrections for heat of vaporization, heat capacity.
  Metadata. Possible entries are:
Sigmas (aka Denoms). The total uncertainty of the observable (experimental + simulation uncertainty), expressed in the unit system of the column heading or the default unit. The number of entries must equal the number of observables being calculated - or provide multiple rows e.g. observable_sigma with one uncertainty per row. Note: This may be modified to specify the experimental and simulation uncertainty separately. My experience is that calculating the simulation uncertainty during the optimization leads to stability issues, so pre-computing the values is recommended.
Weights. The relative weight of the observable with respect to other observables. Note that weights are hierarchical; the individual weights in the table are within a single observable, these weights are across observables, and the ForceBalance input file has weights across targets.
True/False switches for activating or deactivating QM corrections.
Support for multiple files:
Since the columns contain more than just experimental data (they actually contain some information that is pertains to the optimization), we should support splitting these columns into a separate file.
At the same time, we don't want to sacrifice the convenience of having everything in a single file, especially when the whole optimization concerns just one or two data points.
Because of this, we will support multiple files with formal separation between data types (experimental data / parameters for optimization) but still allow single files that combine multiple data types.
The ForceBalance input file needs to specify which files to read, but does not have to specify what each file contains. All of the files are parsed in the same way and stored in the Target class.
Potential pitfalls and concerns:
What experimental observables haven't I thought about, that may not fall into this framework?
Writing tab-delimited files by hand may confuse the tab-delimited parser (e.g. using tabs to create a fixed-width effect but putting different numbers of tabs between the columns in each row.)
Sometimes pressure is not a control variable but an observable (for example in NVT simulations). Same goes for number of particles / chemical potential - because if we choose to simulate in one ensemble, we obtain statistics on the conjugate thermodynamic variables. How do we account for this dual role?
When referencing files in the data table, should columns be numbered from one or zero? I am inclined to say one.

leeping commented 10 years ago

@ebran, @jchodera and @kyleabeauchamp are especially asked to chime in. :)

leeping commented 10 years ago

Well, one experiment that wouldn't fit into this framework is a 2DIR spectrum since it's not a single column but rather an entire table. I probably won't be fitting 2DIR data in the near future, but is there anything else similar to this?

ebran commented 10 years ago

Hi Lee-Ping,

I like the overall structure of the suggested format of the experimental data file. Most of the things we discuss are covered. A couple of comments:

Experimental data input file o I don’t see the need to allow metadata within the data table. It will only be confusing (allowing comments is nice though since you can make the table more human-readable). o Does space-delimited (human-readable) data need to be fixed-width? Isn’t it enough to use an arbitrary number of whitespace characters between fields?
Ensemble variables o I think that there is no way around the fact that the ensemble control variables have to be user-specified and passed around within ForceBalance. In principle there could be an “Ensemble”-field where these variables are specified (think “T=300,p=1.0”). But I don’t think it makes sense to add such a layer of complexity at this point. Covering NVT and NPT is sufficient, and the two cases can be separated by specifying temperature in both cases and pressure=None for NVT.

Best, Erik

jchodera commented 10 years ago

Have you seen standard experimental data file specifications, like ISATAB?

Also, many of our instruments generate XML formats natively, but I suppose there will always be a need for an intermediate processing step.

leeping commented 10 years ago

Erik: Thanks. Regarding metadata: It has to go somewhere. Are you suggesting this can go in the ForceBalance input file or in a separate file in targets? The experimental uncertainties are a property of the data so I would prefer to keep it in the same file if possible. The quantum corrections are more a property of the force field, so we can move those to the input file.

Regarding column width: The fixed width formatting helps when there is missing data, or if we want to include array data in the same table. Otherwise the data could be read into the wrong column.

snap00036

Regarding ensembles: The only potential case for ambiguity in the near future is running NVT simulations and measuring the pressure tensor. Constant chemical potential simulations are still a ways off. The volume / density cannot be specified in the data table anyway, so it's really just the pressure. I think we can introduce the observable name pressure_obs and this won't add any layers of complexity.

John: I should have known there's a more general format out there. :) But I looked at the ISATAB example files and it looks like most of the measurements are biological, and somewhat different from the "fittable" properties that we'd like to include.

Our measurements are more physical / chemical in nature and I think they're a lot simpler than many biological experiments. By specifying our own format, one advantage is that we can have it "make sense" for ourselves and still keep it simple.

Now that I've looked at their tables, it's clear our format cannot encompass all experiments; thus I should probably just restrict the "purpose" statement above.

XML format is great for computers to read/write but not easy for people to read/write. Since the simplest applications involve typing in a few entries by hand, I agree an intermediate processing stage for XML would be good (we could bundle the converters in bin or extras if desired).

Lee-Ping

leeping commented 10 years ago

More notes regarding fixed width: It shouldn't be a hard-coded fixed width like the PDB. We should determine the column width from the header line.

I think both left-justified and right-justified text are okay since the parser could figure this out, but right-justified text is more natural for the parser (the end of each word in the header line determines the column width).

ebran commented 10 years ago

I agree with that. Clarification on metadata: I meant that I don't think metadata inside a table should be allowed. But before or after.

-- Erik

On 2014-04-01 17:22, Lee-Ping wrote:

More notes regarding fixed width: It shouldn't be a hard-coded fixed width like the PDB. We should determine the column width from the header line.

I think both left-justified and right-justified text are okay since the parser could figure this out, but right-justified text is more natural for the parser (the end of each word in the header line determines the column width).

— Reply to this email directly or view it on GitHub https://github.com/leeping/forcebalance/issues/59#issuecomment-39218361.

Erik G. Brandt Postdoctoral Researcher Stockholm University

kyleabeauchamp commented 10 years ago

IMHO tab or whitespace delimited files are a good compromise of easy parsing and easy reading. CSV files are very foolproof against formatting errors, but are a bit harder to read.

Avoid anything that looks like PDB, as PDB is complete garbage.

leeping commented 10 years ago

Erik: I think it was more from a parsing perspective; if we explicitly disallow metadata inside the table, the parser will need to know "am I before the table, inside the table, or finished reading the table"? Otherwise, the parser only needs to know "have I read the header yet?" which is simpler. We can explicitly disallow it though.

Kyle: Okay. Since my proposed fixed width format is different from the PDB, I am assuming you're okay with it. :)

kyleabeauchamp commented 10 years ago

yes

jchodera commented 10 years ago

I think metadata, especially provenance data, is absolutely essential. It's critical the source of the data is traceable.

Do you envision having a separate data file for each source, or could one data file have data from multiple sources?

I think tab, csv, or whitespace delimited files are garbage. They are too inflexible to represent the fact that the data may be pretty sparse over the conditions you are interested in, they are not really machine-readable in any complex way (e.g. how do you specify units?!), and they barely qualify as human readable.

jchodera commented 10 years ago

I think your best approach would be to have an XML file where you can easily specify sparse measurements of the same property potentially culled from multiple data sources, with as much additional metadata as possible about the conditions, units, error, and provenance in a computer-readable format.

jchodera commented 10 years ago

Weights are entirely divorced from the data, and reflect your own decisions about how to balance different classes of data or data sources. Those don't belong in a file describing experimental data.

jchodera commented 10 years ago

Also, True/False switches for QM corrections, and in fact anything related to the model you use for the experimental observables, should be separate.

You probably really want several distinct sets of files:

Experimental data, uncertainties, provenance, observable classes
Observable models and parameters
Weights and other things that control how you do the optimization, including priors

By separating things this way, you can mix and match different schemes and observable models. If we later use a Bayesian optimization scheme, we just swap out the part that controls how the optimization is carried out.

kyleabeauchamp commented 10 years ago

I agree that a huge issue with flat text approaches is the curation of metadata

I agree that XML is nice, but I do think there is value in allowing people to edit data in non-XML form, e.g. with Excel. I think this helps eliminate errors during data entry. Also for convenience.

Regarding units in text-based formats: it might be possible to create an automated string-based system of column names that encode units.

jchodera commented 10 years ago

I agree that XML is nice, but I do think there is value in allowing people to edit data in non-XML form, e.g. with Excel. I think this helps eliminate errors during data entry. Also for convenience.

I think this was exactly the reasoning behind ISATAB. Regarding units in text-based formats: it might be possible to create an automated string-based system of column names that encode units.

You could still have a tool that goes back and forth between xml and text formats, or allows you to generate text summaries to check against the original data.

Really, some sort of validation scheme is required to guard against data entry errors. Just having a text file or excel format doesn't do that.

leeping commented 10 years ago

Hi John,

You probably really want several distinct sets of files:

Experimental data, uncertainties, provenance, observable classes

Observable models and parameters

Weights and other things that control how you do the optimization, including priors

We need to have formal correctness without sacrificing the ability to keep things simple. I think an acceptable solution is to support multiple files but still allow single files. This can be done by specifying one or multiple "source data" file names in the ForceBalance input file.

I need to think more about observable models and parameters. Many observables I've looked at are "measured" directly from the simulation. The QM corrections are simply an extra column that corresponds to each set of experimental conditions, while others (such as the Karplus relation) require a model which is coded into ForceBalance or MDTraj, and have adjustable parameters. This needs to be accounted for in such a way that it doesn't encumber users who don't use these models.

One possible solution is to have the choice of observable model in the ForceBalance input file, and store model parameters in the target folder. We may hard-code default values into ForceBalance as I think some of these models are standardized.

I think metadata, especially provenance data, is absolutely essential. It's critical the source of the data is traceable.

Definitely agree, we can use the comments for this. The comments may contain the primary citation, which is responsible for describing how the experiments were done.

Do you envision having a separate data file for each source, or could one data file have data from multiple sources?

In the simple optimization runs, we have a single file that contains all of the data. For more complicated runs, ForceBalance can read from multiple files, and the main data file can reference other files.

I think your best approach would be to have an XML file where you can easily specify sparse measurements of the same property potentially culled from multiple data sources, with as much additional metadata as possible about the conditions, units, error, and provenance in a computer-readable format.

We can work on XML support as a fourth proposed format but I will support the three above formats (.csv, tab delimited, fixed width) first, as they are the most relevant for currently running projects. The XML format could be more rich in provenance information than the others, but it won't make a difference for the optimization.

Weights are entirely divorced from the data, and reflect your own decisions about how to balance different classes of data or data sources. Those don't belong in a file describing experimental data.

Yes, we should support reading from two separate files, one with the data itself and one with corresponding rows for the weights (row indices are needed, as they are global for the whole Target).

how do you specify units?

This was described at the top. I think we can have a mapping from the table units (which may be small, like kJ/mol) to a SimTK quantity (kilojoule_per_mole). I don't want to have the latter in the table because it would make the columns unnecessarily wide, so we could have a mapping from abbreviated units to full units.

jchodera commented 10 years ago

I need to think more about observable models and parameters. Many observables I've looked at are "measured" directly from the simulation. The QM corrections are simply an extra column that corresponds to each set of experimental conditions, while others (such as the Karplus relation) require a model which is coded into ForceBalance or MDTraj, and have adjustable parameters. This needs to be accounted for in such a way that it doesn't encumber users who don't use these models.

An "experimental observable model" specifies everything needed to produce a number that can be compared directly with compiled experimental measurements for a given set of conditions. This would include a mechanical observable A(x) to average over configurations x or properties that are derived from free energy differences between multiple thermodynamic states, any QM corrections that are required, and the like. One might choose between several different models for the same experimentally measured property, but this shouldn't change the value of the experimental property---that's why we would want to separate this model from the data.

For QM-derived quantities that are not experimental observables---these are something entirely different altogether. For example, you probably have lots of water dimer geometries with different associated QM-computed energies. Specifying this requires something like mol2 files for the individual components and information on geometries and QM energies that should be compared to MM energies. Similarly, for torsion drives, something similar is required, but for a single molecule.

jchodera commented 10 years ago

Definitely agree, we can use the comments for this. The comments may contain the primary citation, which is responsible for describing how the experiments were done.

Have you seen my slides from the CADD GRC the DDT report?

If you haven't seen it, this report is worth looking through: USGS Water-Resources Investigations Report 01-4201, 2001. http://pubs.usgs.gov/wri/wri014201/pdf/wri01-4201.pdf

Be sure to read the abstract and look through the figures.

Since it's so prevalent and incredibly toxic, the aqueous solubility of DDT is one of the more important experimental measurements that humans have conducted, and yet, see how hard it is to actually find sensible literature values?

leeping commented 10 years ago

An "experimental observable model" specifies everything needed to produce a number that can be compared directly with compiled experimental measurements for a given set of conditions. This would include a mechanical observable A(x) to average over configurations x or properties that are derived from free energy differences between multiple thermodynamic states, any QM corrections that are required, and the like. One might choose between several different models for the same experimentally measured property, but this shouldn't change the value of the experimental property---that's why we would want to separate this model from the data.

That's fair. The model cannot be contained entirely in these files because it includes the code to compute the observable from the simulation results. Also, for many observables the user doesn't really need to consider the model because it is so straightforward (e.g. the density).

We can separate the model parameters from the data by having optional multiple files, and the user may choose which model to use in the ForceBalance input file, but there should exist default models and parameters (in the code) that are invoked when the user doesn't specify anything.

Have you seen my slides from the CADD GRC the DDT report? If you haven't seen it, this report is worth looking through: USGS Water-Resources Investigations Report 01-4201, 2001. http://pubs.usgs.gov/wri/wri014201/pdf/wri01-4201.pdf

I just read it. The differences between experimental values over the years is astounding, as is the number of multi-level references. It means the later authors didn't bother to look at their references carefully enough. I agree we should choose our sources carefully and document them in the table, so at least it's possible for someone to double check later on.

leeping commented 10 years ago

QM data is somewhat of a separate topic. I have been storing QM data using multiple files - one file that contains a sequence of structures (using standardized formats) and another space-delimited plain text file called qdata.txt containing energies, forces, interactions, etc. These files tend to be much larger (e.g. 10,000 structures).

However, this cannot encompass all types of QM data (e.g. binding energies, which are a function of multiple structures, or vibrational frequencies / eigenvalues). I have some specific formats for these target types, with examples and accompanying documentation. They have served their purpose so far.

I think the formats for storing QM data could be improved following what we learn in specifying experimental data, but for now we can keep it as is. The qdata.txt format is from my grad school days - it has some deficiencies but it is also quite simple, which I think is a good thing.

ebran commented 10 years ago

Perhaps it's best to stick to simplicity and allow it then.

leeping / forcebalance

Experimental Data File Specification #59

Specification for experimental data file format in ForceBalance (updated 4/1/2014)

File Formats: Three formats will be available.

Content of the files: There are three types of lines in the experimental data file, in addition to empty lines (which are ignored).

Content of the data table: The columns in the data table can be one of the following. Column headers are case-insensitive.

Metadata. Possible entries are:

Support for multiple files:

Potential pitfalls and concerns: