gruenewald-lab / CGsmiles

Coarse-Grained Smiles (CGsmiles) for representing abitrarily complex molecules using a compact line notation
5 stars 2 forks source link

Annotations #25

Closed fgrunewald closed 1 day ago

fgrunewald commented 1 month ago

Martini CG mappings are a curious thing; More complex molecules frequently have atoms that are ignored in the mapping (i.e. zero weight) or reweighted. This PR implements a syntax for this. Optionally, nodes can be given weight using[node;weight], where weight may be a float. Hydrogens need not be given as we figure them out later, but canonically speaking, it would be better. Annotation of the weight may occur before or after the chirality operator. A default weight of 1 is given to all nodes. Hydrogen atoms unless explicit get the same weight as the node they are attached to.

For example, hexadecane but we map only the central two carbons: {[#C1]|4}.{#C1=[C;0]CC[C;0]}

pckroon commented 1 month ago

Don't hate it. We could also use the 'class' attribute for this (which already exists in smiles). The downsides of that are that it's only integers, and the default is 0; the good news here is that pysmiles doesn't set a class for atoms that don't specify one. I'm fine with using ;<float> instead as well, but I strongly suggest they come all the way at the end of an atom specification (and only there). As for implicit hydrogens, I see 2 options: either a) implicit hydrogens get the same weight as their parent; or b) the weight of the parent is split/distributed over the hydrogens. a) gets you more CoG-like behaviour, but CH3 suddenly has more weight (=4) than CH2 (=3), even though you didn't say that. b) means that CH3 (=0.25 per atom) and CH2 (=0.33 per atom) both have the same weight, but it's less CoG-like

I think a) is the better option, because CoG is a sane default. Needs to be documented though.

fgrunewald commented 3 weeks ago

@pckroon I prefer to strip them from the smiles string, because in that fashion CGSmiles and SMILES fragments are treated the same. The question is if one would want to allow general annotation for CGSmiles as well. Something like [#TC5;label]. I might add that in a seperate PR.

pckroon commented 2 weeks ago

[#TC5;w=0.5;name=a;attr=42] you mean? Kind of neat idea.

fgrunewald commented 2 weeks ago

@pckroon yes that's the idea pretty much. In Martini there are so many meta infos, which one might want to add. But again I think a seperate PR would be better

pckroon commented 2 weeks ago

Well, yes and no. You/we need to decide on the syntax here and now to see how the weights should get implemented: [C;w=0.4] or [C;0.4]. If we decide on the latter, there's the question how the 2 functions are going to interact

fgrunewald commented 2 weeks ago

ahh yes that is true of course

fgrunewald commented 2 weeks ago

let me sit on this until tomorrow

fgrunewald commented 2 weeks ago

@pckroon here my thoughts:

What I like:

What I don't like:

@ricalessandri what are your thoughts?

pckroon commented 2 weeks ago

Agnosticism is a big plus when programming, especially when it comes to syntax design (IMNSHO). It makes it more likely you get it right the first time.

another series of equal signs

You need some sort of separator between labels and values, so if you want bead annotations this is unavoidable

the annotations become more verbose

This is indeed a downside. You could argue that, since weights are going to be the most used, they can go without label. Second thought: treat them like args and kwargs: you can first put positional annotations, then keyword annotations. The application determines the semantics. Use https://docs.python.org/3/library/inspect.html#introspecting-callables-with-the-signature-object

it is not clear how attributes propagate to hydrogen atoms; although the sane default is copying the attribute I guess

Sounds reasonable. If you want to annotate hydrogens, make them explicit.

ricalessandri commented 2 weeks ago

My 2 cents: This seems neat and useful. About the verbosity concern - since weight=1 will likely be the implicit default, verbosity only matters when you need different weight values. And for weights (probably the most common case), [C;w=0.4] isn't much longer than [C;0.4]. The more explicit syntax actually makes the cgsmiles more human-readable, so I'd lean towards [C;w=0.4]. Though I won't oppose the idea of having the weights working (also) without the "w" label: so, if one omits the w=, 0.4 is still read as a weight.

fgrunewald commented 2 weeks ago

As decided by popular vote here it is:

We now allow general annotations of the type [keyword=value,keyword=value]. In addition to making the notation more concise, we have positional arguments defined by a dialect. These positional arguments also set default values for example charge: 0 and weight: 1. They also overtake the charge argument. Thus charged beads have to be written as [#PO4;1.0] or [#PO4;c=1.0].

For the CG part, we adopt the CGSmiles base dialect with charges and weights as positional arguments. Therefore, one can write [#PO4;1.0;0.5] corresponding to charge1 and weight of 0.5. For a neutral residue [#SP4;0;0.5] or [#SP4;w=0.5].

The notation also allows for example to write:

[#ION;1.0;type=Na;cg=1;resname=Na] if one wanted that.

We also support the annotations for the all-atom part, but we should not advertise it. At the moment one should only specify the weight. Charges should still be specified as [Na+] for example. In the long run, when pysmiles gets a minimal parser one can change that and align the annotations.

So bottom line:

New dialects can be created using the create_dialect function and making a partial(_parse_node, dialect_signature=YOUR_DIALECT). However, at the moment there is no sleek way to put this into the parsing pipline. @pckroon any thoughts where to inject it.