reymond-group / map4

The MinHashed Atom Pair fingerprint of radius 2
MIT License
103 stars 30 forks source link

Folder description:

MAP fingerprint - Design and Documentation

The canonical, not isomeric, and rooted SMILES of the circular substructures CS from radius one up to a user-given radius n (default n=2, MAP4) are generated for each atom. All atom pairs are extracted, and their minimum topological distance TP is calculated. For each atom pair jk, for each considered radius r, a Shingle is encoded as: CSrj|TPjk|CSrk , where the two CS are annotated in alphabetical order, resulting in n Shingles for each atom pairs.

MAP4 atom pair encoding scheme

The resulting list of Shingles is hashed using the unique mapping SHA-1 to a set of integers Si, and its correspondent transposed vector sTi is MinHashed.

MihHash

To use the MAP4 fingerprint:

To install map4 trough Conda:

To install map4 trough pip:

Run the fingerprint from terminal

Or import the MAP4Calculator class in your python file (see test.py)

Please note that the similarity/dissimilarity between two MinHashed fingerprints cannot be assessed with "standard" Jaccard, Manhattan, or Cosine functions. Due to MinHashing, the order of the features matters and the distance cannot be calculated "feature-wise". There is a well written blog post that explains it: https://aksakalli.github.io/2016/03/01/jaccard-similarity-with-minhash.html. Therefore, a custom kernel/loss function needs to be implemented for machine learning applications of MAP4 (e.g. using the distance function found in the test.py script).

MAP4 - Similarity Search of ChEMBL, Human Metabolome, and SwissProt

Draw a structure or paste its SMILES, or write a natural peptides linear sequence. Search for its analogs in the MAP4 or MHFP6 space of ChEMBL, of the Human Metabolome Database (HMDB), or of the 'below 50 residues subset' of SwissProt.

The MAP4 search can be found at: http://map-search.gdb.tools/.

The code of the MAP4 similarity search can be found in this repository folder MAP4-Similarity-Search

To run the app locally:

Extended Benchmark

Compounds and training list used to extend the Riniker et. al. fingerprint benchmark (Riniker, G. Landrum, J. Cheminf., 5, 26 (2013), DOI: 10.1186/1758-2946-5-26, URL: http://www.jcheminf.com/content/5/1/26, GitHub page: https://github.com/rdkit/benchmarking_platform) to peptides.