mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
189 stars 27 forks source link

Will has_exact_match care about trivial hydrogens or chiral centers? #102

Closed Boxylmer closed 7 months ago

Boxylmer commented 10 months ago

I'm working on a possible way to automatically find i"mportant" functional groups within a set of smiles. This involves...

  1. Scanning the dataset for all atoms, their hybridization, and aromaticity, which are lumped into tokens. (most sets have usually 23-30 unique tokens)
  2. generating all possible fragments of these tokens of size N (typically 3)
  3. Searching for the number of all possible occurances (I.e., including overlapping) within the dataset.

I've achieved the generation, but not being able to quickly search and I'd like to find which method is most appropriate for this. Given that I can manually generate the fragments to avoid iterating through smiles or Smarts queries, which search function should I use?

mojaie commented 7 months ago

I apologize for very late reply. Sorry if I didn't understand your question correctly, but would this be similar to the task "Functional group analysis" in the following tutorial?

https://mojaie.github.io/MolecularGraph.jl_notebook/substructure_and_query.jl.html

I think the only way to do this is iterating through all dataset as you mentioned, at least in MolecularGraph.jl. I'm also interested in this field, and there may be some room for performance improvement of substructure search algorithms.

Boxylmer commented 7 months ago

No issues on the delay! We're all busy and I really appreciate this project and the work you've put into it.

Background to this mini project: I want to see if arbitrarily fragmenting molecules can allow me to do data augmentation through building "functional group graphs" instead of graphs of atoms. This way, I can have multiple "functional group graphs" generated from the same molecule that has a property associated with it.

The atom token needs to be fast so that it can be used in quick comparisons and as building blocks for the functional groups they make up.

function AtomToken(mol::SMILESMolGraph, idx::Integer)
    aromaticity = is_aromatic(mol)[idx]
    atomic_number = UInt8(atomnumber(atomsymbol(mol)[idx]))
    hybrid = UInt8(hybridization_symbol_to_int(hybridization(mol)[idx]))
    return AtomToken(atomic_number, aromaticity, hybrid)
end

I wasn't able to generate arbitrary smarts with this, but I just constructed my own graph of these tokens and made a graph search for them that could find all possible instances of linear groups. The linear part is a concession I made because it lets me simplify the subgraph search significantly, but also means that only n=3 size groups make sense, as the possibility of branched subgroups starts at n=4.