matteoferla / Fragmenstein

Merging, linking and placing compounds by stitching bound compounds together like a reanimated corpse
MIT License
181 stars 13 forks source link

custom MCS #23

Closed matteoferla closed 2 years ago

matteoferla commented 2 years ago

It has been suggested that it would be constructive to provide a precomputed mapping between the fragments and the follow-up for Victor.place and I agree.

Adding a partial mapping override would be straightforward with little effort. But problem as I see it is a standards one.

Partial mapping override

I can see two ways of doing this, the shoddy way by assigning an isotope label and comparing isotopes and subclassing Chem.rdFMCS.MCSAtomCompare. The former would need no change to monster._get_atom_maps as this accepts unpacked kwargs which are passed to Chem.rdFMCS.FindMCS, but it would not match atoms of different elements. The second would mean that I could clean up the messiness of monster._get_atom_maps, which is inelegant in its prevention of dummy atom to dummy atom and hydrogen atom mapping and technically bad on some corner cases. The issue would be switching from the rdFMCS enum system to the rdFMCS parameter system which AFIK aren’t switchable —but it would take like 20 minutes to hard code a case-switch.

I suppose it would be a new optional argument in Monster.combine (and Victor.combine and its annoying error catching gunk) which gets parsed depending on its schema.

Choice

Mol and sdf do not have atom names —rdkit_to_params.Params gives atom names to the followup— so I think atom index is the best option. Plus, the indices used for a Chem.Mol from a SMILES follow the linear left-to-right order, so the followup will be constant in its numbering.

With Walton which has a method for superposition by map I noticed that it was annoying determining how to map the atoms:

demo = Walton.from_smiles(resorcinol='c1ccc(O)cc1O', eugenol='Oc1ccc(cc1OC)CC=C') # create instance demo.align_by_map({(0,1):{4:0, 3:1, 2:2}})  # align molecules by atom indices 
demo()  # merge (Fragmenstein's Monster) 
demo.show3d()  # show 2d, names and 3d 

Basically I used Monster.draw_nicely method to match up the atoms as it shows the indices (example): image

So it is not impossible, just awkward.

An extreme option would be displaying the SVG of a molecule drawn with atom indices labelled is modified to have a class on the atom index labels and a JS function added that on clicking them passes the molecule and index to python. The communication between JS to Python is different between Jupyter notebook classic and Colab but is majorly absent in Jupyter Lab —making a widget is way way overkill. image

Format

The Monster.show_comparison() gets the “origin” from the positioned_mol. The origins being a list (index=atom index) of lists of strings with inspiration mol name dot inspiration mol atom index, Which is different from the horrible mapping of the followup to combined mols (as fragments) from Unmerger.

# followup atom to inspiration atom
mapping = [[], ['hit1.1']]

I think its pretty reasonable, but for a partial mapping, a dictionary may be better:

# followup atom to inspiration atom
mapping = {1: 'hit1.1'}

Is it still too complicated? @rsanchezgarc what do you reckon?

rsanchezgarc commented 2 years ago

I think that there is a third cheaper option for implementing the Partial mapping override. Wouldn't it be enough to enumerate all default MCS solution and filter out those that are not compatible with the mapping provided by the user? The isotope trick sounds too hacky I would prefer to avoid it. Improving _get_atom_maps should be the best option, but I don't know how difficult would it be.

With respect to how to perform the Choice, I guess that the user only needs to display in the same figure the follow-up and the inspirational fragment with numbered atoms. I think that providing an interactive GUI to select the atoms is overkilling. One think that may be needed while displaying the fragment is to provide some protein context (e.g., closest residue).

Finally, for the mapping format provided by the user (within monster.combine and victor.combine), I would prefer to have a nested dictionary rather that your options:

mapping = { 'hit1': {1:1,1:5} 'hit2': {3:3,4:4,4:6}}

But providing it is well documented, any option will work

matteoferla commented 2 years ago

I'll get back to you on the first part, but that is a good point.

I like your proposal of the nested dictionary where the hit name gets the top level as opposed to the atom id.

About showing the protein neighbourhood, I have figure out how to show neighbours in nglview (the notebook widget) image So that is possible. Although, the problem there is the atoms are labelled by atomname and not index. And interesting turn however is that to show the neighbourhood now I know one can send JS code for execution to the NGL widget frontend which means that that selection snippet I gave to Tyler could be used and made to return a response to the Python kernel.

matteoferla commented 2 years ago

I just did a pull requests (26th) and in writing up the summary and showing an impossible corner case I spotted a bug! But the changes that will be merged in a future pull are:

Schema mods

I have added the functionality discussed along with two corner cases using negative numbers to forbid certain matches:

{ "hit1": {1:1,1:5, -1: 4} "hit2": {3:-1,4:4,4:6}}

One potential issue is that one cannot specify optional matches. This would complicate the input system a lot. For example using complex numbers could work on paper, but would make it very byzantine.

{ "hit1": {1:1,1:5+1j} "hit2": {4:4,4:6}}

The other alternative is using enums. But it seems overly complicated for a minor feature.

Expand

One major issue I encountered was that the unmerge step was wasting time trying map combinations that where to be rejected anyway. As a result I added yet another placement tactic 'expansion'. This is the same as 'no blend' tactic but calculates the potential maps as normal but then gets the one that fits the most atoms and for each of those 'primary' maps re-calculates the maps for the other hits but constrained to the primary map. The MCS works by calculating an atom to atom matrix and a bond to bond matrix and finding the best connection: by virtue of having a more punitive atom to atom, the map recalculation is faster making it quicker than unmerge's conbination testing. It still is not fast though.

Death case

As mentioned I just run a test that is impossible and a side bug arose.

Namely, the placement of a compound with a different length linker —addressable by BRICS decomposition of the hits only:

benzylbenzene = Chem.MolFromSmiles('c1ccccc1Cc1ccccc1')
benzylbenzene.SetProp('_Name', 'benzylbenzene')
assert AllChem.EmbedMolecule(dibenzomethane) == 0
# extra carbon in bridge
Monster([benzylbenzene,]).place_smiles('c1ccccc1CCc1ccccc1')

In fact similar problems like this is the reason why the prior-merging of hits (the 'partial' tactic) was so unfruitful.

matteoferla commented 2 years ago

PS. I should have said https://github.com/matteoferla/Fragmenstein/blob/dev-issue23/documentation/monster/custom_mapping.md has one of the test cases used