[Similarity module] Add more similarity measurements

theochem / Selector

Python library of algorithms for selecting diverse subsets of data for machine-learning.

https://selector.qcdevs.org

GNU General Public License v3.0

22 stars 21 forks source link

[Similarity module] Add more similarity measurements #124

Open FanwangM opened 1 year ago

FanwangM commented 1 year ago

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

PaulWAyers commented 1 year ago

As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist.

The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero.

I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work.

We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn (distances) sklearn.metrics.DistanceMetric (similarities and divergences) sklearn.metrics.pairwise

I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task.

I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them.

FanwangM commented 1 year ago

Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity.

FarnazH commented 1 year ago

Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials.

PaulWAyers commented 1 year ago

@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them.

ramirandaq commented 1 year ago

Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints.

sim_indices.txt

FanwangM commented 1 year ago

Thanks for sharing. I am copying @ramirandaq 's code for readibility.

import numpy as np

# Pairwise similarity indices calculated over binary fingerprints

def indicators(x, y):
    """Calculating base descriptors
    a : number of common on bits
    d : number of common off bits
    dis = b + c : 1-0 mismatches
    p : len of fingerprint
    Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
    """
    p = len(x)
    a = np.dot(x, y)
    d = np.dot(1 - x, 1 - y)
    dis = p - a - d
    return a, d, dis, p

# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n

x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])

a, d, dis, p = indicators(x, y)

bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)

fai = (a + 0.5 * d)/p

ja = (3 * a)/(3 * a + dis)

jt = a/(a + dis)

rt = (a + d)/(p + dis)

rr = a/p

sm =(a + d)/p

ss1 = a/(a + 2 * dis)

ss2 = (2 * (a + d))/(p + (a + d))

PaulWAyers commented 1 year ago

Just to clarify, all of these are "bitwise". We have: a = logical "and" between bitstrings; intersection between sets if for each element, "1" or "on" means an element/feature is present. d = logical "not and" between bitstrings; {universe} - {union} between sets if "1" or "on" means an element is present. So these are "features that are not present in either set" dis = logical "exclusive or" between bitstrings. {union} - {intersection} if "1" or "on" means an element is present. So these are "features that are present in one item, but not present in the other".

As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function.

FarnazH commented 1 month ago

@marco-2023, please:

[x] Rename https://github.com/theochem/Selector/blob/main/selector/similarity.py to measures/similarity.py
[x] Move diversity.py and convertor.py to the measures module.
[ ] Implement any similarity measure and test your heart desires (thanks!)