rcsb / mmtf

The specification of the MMTF format for biological structures
http://mmtf.rcsb.org/
44 stars 17 forks source link

Remark/comments field #32

Closed danpf closed 6 years ago

danpf commented 6 years ago

When working on modeling/prediction/design problems I know a lot of people add comments/remarks of various things to their PDB files. In the case of structures from the PDB, I think it would be best if this field is empty always.

Possible use cases:

It would be very useful to add a field dedicated to this. probably: extras or comments and it would just be a string field.

The alternative is to just to use title or structureId for this kind of stuff since in most modeling they don't exist. I'm not against that either, but the spec documentation should just note which one applications should use so it's standardized. ~Dan

gtauriello commented 6 years ago

For many use cases I agree with the possibility of a generic string field. Sounds light-weight and generic enough.

For quantities attached to residues and atoms on the other hand (e.g. model quality numbers), it might be nicer to have a standardized way to attach a list of numbers into the mmtf file so that any viewer could color the structure according to one of those quantities...

danpf commented 6 years ago

That would be nice too...

I guess 3 quick ideas:

  1. Pack as raw-string-json. let application handle json parsing
  2. Pack as dictionary of strings. let application handle going from string to int/double
  3. Pack via msgpack, let user handle msgpack obj decoding.

option 3 gets a little complicated with statically typed languages, but is probably the better option

Some keys could be standardized keys like color or atom_color or residue_color for molecular viewers? should probably ask a few mol-viewer people their thoughts on that.

speleo3 commented 6 years ago

+1 for option 3 +1 for standardized keys like atomColorList - also chargeList (or partialChargeList) and radiusList to replace formats like PQR

A convention for non-standard keys would also be useful, this could prevent name clashes with future standard keys. E.g. if standard keys never use underscores, then an <appname>_ or <organization>_ prefix for custom keys could never lead to a naming conflict.

speleo3 commented 6 years ago

speaking of custom keys: PyMOL 2.1 exports MMTF files with two custom keys: pymolRepsList (encoded with strategy type 7) and pymolColorList (plain msgpack array).

danpf commented 6 years ago

speaking of custom keys: PyMOL 2.1 exports MMTF files with two custom keys: pymolRepsList (encoded with strategy type 7) and pymolColorList (plain msgpack array).

Perfect, now I know someone else would use this :p

A convention for non-standard keys would also be useful, this could prevent name clashes with future standard keys. E.g. if standard keys never use underscores, then an or prefix for custom keys could never lead to a naming conflict.

I guess the only thing to watch out is that we might have pymolColorList and chimeraColorList and nglColorList... But i think pymol::ColorList or pymol::color_list would be best if we were to standardize it, pymol people love their underscores. I'd feel bad taking them away from them hah

danpf commented 6 years ago

@arose @pwrose

This is sort of a more formal proposal for a comments field:

It seems that myself and other developers are eager to append application specific information into our mmtf files, so having this become part of the standard would be very helpful, and save a lot of re-writing once/if it does eventually become a part of the standard.

Does anyone have any objections to this sort of implementation? The alternative as @speleo3 mentioned above, is to pack any extraData directly into the base dictionary of the packed mmtf file

An example implementation for c++ is available at https://github.com/rcsb/mmtf-cpp/pull/15


extraData

This is a field to store any extra mmtf associated data. it is packed as a msgpack object, and therefore could contain anything, it is up to you (the developer) how you would like to store / pack / read data. It is sort of the equivalent of the pdb REMARKlines.

However, we would recommend that you use the format MAP< string, msgpack object > this allows standardized read in between applications, and is easily understandable and extensible across languages.

We do request that when using the MAP format described above, that you adhere to the following standardized key, value pairs:

key value description encoding
groupColorList list[hex code strings (len of numGroups)] None
atomColorList list[hex code strings (len of numAtoms)] None
etc etc etc

more to be decided?

pwrose commented 6 years ago

Regarding the key, did you imply a convention regarding the prefix, e.g.,

structureKey (len of 1) modelKey(len of numModels) chainKey (len of numChains) groupKey (len of numGroups) atomKey (len of numAtoms) bondKey (len of numBonds)

danpf commented 6 years ago

I wasn't really meaning to, but we could if other people like that! definitely makes sense to me!

pwrose commented 6 years ago

How about an explicit convention by specifying data (or properties?) for structure, model, chain, group, atom, and bond-level information that must have a matching number of records.

Data (properties) that don't fit into the categories above, would go into extraProperties.

-Peter

On Tue, Jul 17, 2018 at 12:52 PM, Daniel Farrell notifications@github.com wrote:

I wasn't really meaning to, but we could if other people like that! definitely makes sense to me!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rcsb/mmtf/issues/32#issuecomment-405706502, or mute the thread https://github.com/notifications/unsubscribe-auth/ADuwEP323n3Ii-aNOlH6vDe1xYDnz3k0ks5uHkCDgaJpZM4S2avh .

gtauriello commented 6 years ago

A "best practice" naming convention sounds reasonable.

@pwrose do you mean that each of those "...Properties" fields would itself contain a msgpack-map with key, value pairs? Doesn't sound too bad actually. Would make it very easy to have generic parsers of it for visualizations or so (could even work in strongly-typed languages like C++). In that case though I would propose to get rid of "extraData" and have those "...Properties" as optional fields at the top-level of the MMTF hierarchy. Otherwise we introduce an extra level of complexity (also there is currently no case of optional fields outside of the top-level of the MMTF hierarchy).

speleo3 commented 6 years ago

@pwrose and @gtauriello - if I followed you correctly, example data could look like this:

data = {
  "mmtfVersion": "1.1",
  "numAtoms": 999,
  "numModels": 2,
  "numChains": 4,
  ...
  "xCoordList": [1.2, 3.4, ...],
  "yCoordList": [5.6, 7.8, ...],
  "zCoordList": [9.0, 1.2, ...],
  ...
  "structureProperties": {
    "foo_id": "ABC",
  },
  "modelProperties": {
    # lists have len numModels=2
    "foo_rmsdList": [0.5, 0.8],
    "foo_scoreList": [1.2, 3.4],
  },
  "chainProperties": {
    # lists have len numChains=4
    "foo_uniprotIdList": ["HBB_HUMAN", "HBA_HUMAN", "HBB_HUMAN", "HBA_HUMAN"],
    "foo_chainColorList": [0xFF0000, 0x00FF00, 0xFF0000, 0x00FF00],
  },
  "groupProperties": {
    # lists have len numGroups
    "stride_secStructList": [7, 7, 7, ...],
    "sst_secStructList": [7, 7, 7, ...],
  },
  "atomProperties": {
    # lists have len numAtoms=999
    "pymol_colorList": [1, 2, 3, ...],
    "pymol_repsList": [1, 1, 1, ...],
    "apbs_chargeList": [0.1, -0.4, 0.7, ...],
    "apbs_radiusList": [1.2, 1.8, 1.5, ...],
  },
  "bondProperties": {
    # lists have len numBonds
    "pymol_bondTypeList": [1, 1, 1, 4, 4, 4, 4, 4, 4, 1, ...],
  },
  "extraProperties": {
    "pymol_bondTypes": {0: "metal", 1: "single", 2: "double", 3: "triple", 4: "aromatic"}
  },
}
pwrose commented 6 years ago

Yes, that's a good example of what I had in mind.

On Wed, Jul 18, 2018 at 9:11 AM, Thomas Holder notifications@github.com wrote:

@pwrose https://github.com/pwrose and @gtauriello https://github.com/gtauriello - if I followed you correctly, example data could look like this:

data = { "mmtfVersion": "1.1", "numAtoms": 999, "numModels": 2, "numChains": 4, ... "xCoordList": [1.2, 3.4, ...], "yCoordList": [5.6, 7.8, ...], "zCoordList": [9.0, 1.2, ...], ... "structureProperties": { "foo_id": "ABC", }, "modelProperties": {

lists have len numModels=2

"foo_rmsdList": [0.5, 0.8],
"foo_scoreList": [1.2, 3.4],

}, "chainProperties": {

lists have len numChains=4

"foo_uniprotIdList": ["HBB_HUMAN", "HBA_HUMAN", "HBB_HUMAN", "HBA_HUMAN"],
"foo_chainColorList": [0xFF0000, 0x00FF00, 0xFF0000, 0x00FF00],

}, "groupProperties": {

lists have len numGroups

"stride_secStructList": [7, 7, 7, ...],
"sst_secStructList": [7, 7, 7, ...],

}, "atomProperties": {

lists have len numAtoms=999

"pymol_colorList": [1, 2, 3, ...],
"pymol_repsList": [1, 1, 1, ...],
"apbs_chargeList": [0.1, -0.4, 0.7, ...],
"apbs_radiusList": [1.2, 1.8, 1.5, ...],

}, "bondProperties": {

lists have len numBonds

"pymol_bondTypeList": [1, 1, 1, 4, 4, 4, 4, 4, 4, 1, ...],

}, "extraProperties": { "pymol_bondTypes": {0: "metal", 1: "single", 2: "double", 3: "triple", 4: "aromatic"} }, }

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rcsb/mmtf/issues/32#issuecomment-405987591, or mute the thread https://github.com/notifications/unsubscribe-auth/ADuwEAhALBACwXgRpjXIjx1CYBWOWHCkks5uH14egaJpZM4S2avh .

danpf commented 6 years ago

I like it! Re-> extraProperties this is more for statically typed languages (like c++) I wrote extraData so that it didn't have to be a map<string, msgpack::object>, rather that it could be anything, (a simple list, a number, a custom serialized object, etc)... Do you think that's useless? and that extraProperties should just always be a map<string, msgpack::object>?

gtauriello commented 6 years ago

@danpf The entries contained in the map can still be generic msgpack objects. So it doesn't really simplify parsing in statically typed languages apart from being able to get the keys (which is good I guess). Either way a bit of structure might be good and it's not a big restriction to prescribe that we expect key (string) / value (any object) pairs for extra properties.

danpf commented 6 years ago

resolved by #36