mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
189 stars 27 forks source link

How can I add a bond between two fragments, given their fragment indices? #85

Closed Boxylmer closed 1 year ago

Boxylmer commented 1 year ago

Sorry for the surplus of issues lately!

Is it possible to stitch two fragments together (or even separate GraphMol structs, I don't mind!) given the atom indices of each and the bond type? If so, how could this be done?

mojaie commented 1 year ago

There is MolecularGraph.Graph.disjointunion for the purpose (see src/graph/disjointunion.jl, and src/graph/cycle.jl for practical usage). However, it is a bit messy and may have performance problems, so another way would be provided in the next version (maybe by using union method in Graphs.jl).

Boxylmer commented 1 year ago

I see!

If I know the indices of the atoms I'd like to bond, and the two smiles I have are provided, I found I can also...

  1. Concat the two smiles via sm1 * "." * sm2
  2. Track the indices of the atoms in each fragment (working on a general way to do this now)
  3. Add a bond manually (SmilesBond) between the two fragments.

Your Jupyter examples mention this may be computationally expensive to do, though in our case we only have to do it once to prepare some data which we can end up saving later before doing the real bulk work. I fear editing these structs directly as I don't fully understand them yet. Is there any gotcha I might run into by adding this bond manually, or is it just inefficient (and not risking unpredictable results) to do so?

mojaie commented 1 year ago

You can safely add bonds by addedge!(mol, v1, v2, bond). Removing nodes/edges may cause unpredictable results, but in your case, concatenating two disconnected molecules by addedge! may work well. This restriction is due to vector-based structure of molecular properties for faster descriptor calculation. In the next version, editable molecule type (like RWMol in RDKit) based on Dict would be implemented.

Boxylmer commented 1 year ago

Excellent! Also vectors in some cases can actually be faster for lookups due to the cost of hashing data for sets and dicts, it looks like this cost breakover point is about ~60 items since hashing methods are O(1) and lookups from vectors are usually O(n).

Thank you again!

mojaie commented 1 year ago

You can find many descriptor functions in src/properties.jl uses array broadcasting. I cannot imagine the case performance of lookup is important (Do you have any idea?). There are many set operations (for example, in subgraph isomorphism matching) so Set of node/edge indices are generated before these calculations.