peastman / MolecularFileFormatSpecs

A repository of specs for file formats used in molecular modelling
7 stars 1 forks source link

How to organize the repository #1

Open peastman opened 4 years ago

peastman commented 4 years ago

Here's an initial proposal on how we can organize this repository.

As an example, OpenMM's forcefield format is described in the manual. So I'll create an "OpenMM" directory containing a PDF of the most recent manual. The accompanying README will give the URL it was downloaded from and reference the sections that describe the format.

Thoughts?

bdice commented 4 years ago

I can work with @joaander to include specifications and design docs for GSD (most commonly used with the HOOMD-blue simulation engine). The GSD docs already have much of the requested info, so we may cross-reference that and fill in whatever information is missing: https://gsd.readthedocs.io

j-wags commented 4 years ago

I think that @peastman's suggestion is a great start, since it will give us authoritative references to resolve conflicts/uncertainty.

I think we'll ultimately want to distill the information from the spec docs down into a table, or set of keywords, describing the "information content" of each file. A quick example might be:

Format Elements Atom types Coordinates Bond orders Harmonic bond parameters Nonharmonic bond parameters Atom formal charges Atom partial charges
SMILES Y N N Y N N Y N
AMBER Prmtop N Y N N Y N
OpenMM XML N Y Y N Y Y N Y
SYBYL/Corina mol2 N Y Y Y N N N Y
TRIPOS mol2 Y N Y Y N N ? Y
...

(the above is probably incorrect, I just quickly jotted some names and categories down)

So, I'd propose that each format could be defined by a set of keywords that it must or may contain. As we look to include more formats/details, it's likely that we'll find that our current keywords aren't fully descriptive, and we'll want add, split, or merge some. So, we could consider each set of keywords to be one version of a specification, and have rules for automatically updating from an old to a new specification (which may include completely automated transformation, or identifying cases requiring human review).

peastman commented 4 years ago

Good idea. We also might group them into a few broad categories:

Of course some formats can store more than one of those. A PDB includes both chemical information and conformations, and it can be used to store trajectories (although it's not a very good format for that).

peastman commented 4 years ago

Let's come up with a list of formats we want to document. Here's a start.

Standard (not application specific) structure formats:

PDB PDBx/mmCIF MOL2 SDF

Trajectory formats:

DCD XTC NetCDF TRR BINPOS DTR XYZ

Application specific MD input and output formats:

CHARMM/NAMD Amber Gromacs OpenMM LAMMPS HOOMD-blue Desmond

Formats used by QC codes:

Not my field! Can someone else fill this in?