mqcomplab / MultipleComparisons

GNU General Public License v3.0
33 stars 10 forks source link

Generate X unique similarity values for X fingerprints with respect to all molecules in the dataset. #1

Open AdamCoxson opened 1 year ago

AdamCoxson commented 1 year ago

Hi,

Say I have a set of 10,000 atoms, each one with a fingerprint 1000 continuous (normalised) scalar values to describe them. Can I use this software to generate 10,000 scalar values, one for each atom, that represents the similarity of the respective fingerprint against all other fingerprints or some arbitrary reference simultaneously?

I've been playing with the code, but from my understanding it only generates a single scalar value to show the similarly of the dataset as a whole? I've gotten a bit lost!

Basically I have a used N-body Iteratively Contracted Equivariants to build up representations of the local atomic environments for all of the atoms in a set of 4000 organic molecules. A representation for a single atom can consist of many continuous scalar values (lets just say 10,000 atoms with 1000 elements in each atomic 'fingerprint' for sake of argument). I can treat these like fingerprints, but I don't want a pairwise comparison. I want to apply some similarity metric that compares these representations and returns an array of 'similarity' scores, one for each fingerprint. Then I can plot a heatmap like the one below, where the phi metric on colourbar scale has been replaced by the 'similarity of atomic environment'.

Obviously, I could just take the sum of all 1000 elements per atom and use that, but surely there is some sort of similarity metric that does a better job.

image

ramirandaq commented 1 year ago

Hi, thanks for your interest in our method! You are correct, if you just use the code in its original form, you'll only get a number for the whole set (e.g., the extended similarity of the set). I'd recommend looking at a related idea that we proposed recently: the complementary similarity (first described here: https://pubs.rsc.org/en/content/articlehtml/2021/cp/d1cp04019g). Basically, this will give you a single number for each different fingerprint. This will give you an idea of how related are all the objects in the original set. We have shown that this can be used to distinguish between "medoid"-like structures and "outlier"-like structures, so it could provide information about different subsets in the data. Let us know if you need help with this, or if you want to talk a bit more we can set a Zoom meeting.

AdamCoxson commented 1 year ago

Thanks for the quick reply!

Okay from that paper, I see three ways of obtaining element-wise similarities:

1) Group similarity. Obtain Gi by summation over pairwise fingerprints Sij, all ij except for i != j.

2) Complementary similarity. To get each Si, calculate the extended Similarity of the entire set with element i removed.

3) Similarity to Mediod. Find the Mediod of the entire set, define this as index k. Calculate all pairwise Sik.

From what I recall; [1] and [2] give the same degree of information, but [2] has better scaling. [3] I initially thought about finding an 'average reference' fingerprint that has the most similarity to all other fingerprints. This seems to be the same some of idea but I can use your mediod finder function.

I think method [1] will suffice for now, and I can use method [2] if I run into computation cost issues. What I am curious about is whether you have any intuition about method [2] compared to method [3]. Could you say whether one is better than the other for clustering based exercise? I have a hunch that [2] will be better as it is a more general method, i.e. if the diversity of the fingerprints is very high, it may be hard for [3] to find a good reference to compare all other fingerprints to.

Thanks!

AdamCoxson commented 1 year ago

I do need a pointer on the code.

data = load_pkl_data("E:/external_models/NICE/nice_4026mols")
fingerprints=normalize_2d(data[0])

n_fingerprints = len(fingerprints)
condensed = np.sum(fingerprints, axis = 0) # Column sums
data_sets = np.array([np.append(condensed, n_fingerprints)]) # Generate datasets
counters = calculate_counters(data_sets) # Calculate counters

jt_nw = (counters['w_a'])/\(counters['a'] + counters['total_dis'])

Fingerprints is a 2D array of 4026 molecules, each with a continous scalar fingerprint of 2500 elements, which I then normalise. This is using the calculate_counters function from condensed_version/multiComp. I then apply the unweighted Jaccard-Tanimoto metric.

In this, I am trying to get the continous extended similarity metric for all fingerprints in my data set.

Does this work as intended for continuous data? I.e. does it automatically use the continuous JT metric, or do I need to code my own function for the cJT explicitly? Which I'm a bit confused by the notation looking at Table 1.

ramirandaq commented 1 year ago

Thanks for the quick reply!

Okay from that paper, I see three ways of obtaining element-wise similarities:

1. Group similarity. Obtain Gi by summation over pairwise fingerprints Sij, all ij except for i != j.

2. Complementary similarity. To get each Si, calculate the extended Similarity of the entire set with element i removed.

3. Similarity to Mediod. Find the Mediod of the entire set, define this as index k. Calculate all pairwise Sik.

From what I recall; [1] and [2] give the same degree of information, but [2] has better scaling. [3] I initially thought about finding an 'average reference' fingerprint that has the most similarity to all other fingerprints. This seems to be the same some of idea but I can use your mediod finder function.

I think method [1] will suffice for now, and I can use method [2] if I run into computation cost issues. What I am curious about is whether you have any intuition about method [2] compared to method [3]. Could you say whether one is better than the other for clustering based exercise? I have a hunch that [2] will be better as it is a more general method, i.e. if the diversity of the fingerprints is very high, it may be hard for [3] to find a good reference to compare all other fingerprints to.

Thanks!

No problem, it's a pleasure to help! You are right [1] and [2] are equivalent, with [2] having a much better scaling. I'd recommend to go with [2], while the improved performance might not be noticeable for sets ~10^4 molecules, ideally you'd want to have everything setup in case you'll need to handle much bigger sets. As for [3], we've never tested it, but my intuition is that it'd probably give relatively similar results to [1] and [2]. What we noticed with [2] in the paper I mentioned before is that it can cleanly separate your set into different areas depending how "well-organized" they are (see Fig. 5 in the paper, the pre-clustering step). This can be extremely helpful, and is an idea we've been playing with a lot in new chemical space visualization techniques right now.

ramirandaq commented 1 year ago

I do need a pointer on the code.

data = load_pkl_data("E:/external_models/NICE/nice_4026mols")
fingerprints=normalize_2d(data[0])

n_fingerprints = len(fingerprints)
condensed = np.sum(fingerprints, axis = 0) # Column sums
data_sets = np.array([np.append(condensed, n_fingerprints)]) # Generate datasets
counters = calculate_counters(data_sets) # Calculate counters

jt_nw = (counters['w_a'])/\(counters['a'] + counters['total_dis'])

Fingerprints is a 2D array of 4026 molecules, each with a continous scalar fingerprint of 2500 elements, which I then normalise. This is using the calculate_counters function from condensed_version/multiComp. I then apply the unweighted Jaccard-Tanimoto metric.

In this, I am trying to get the continous extended similarity metric for all fingerprints in my data set.

Does this work as intended for continuous data? I.e. does it automatically use the continuous JT metric, or do I need to code my own function for the cJT explicitly? Which I'm a bit confused by the notation looking at Table 1.

If you just normalize and then calculate the extended similarity on top of that it'll be equivalent to Variant 3 in this paper: https://link.springer.com/article/10.1007/s10822-022-00444-7 However, I'd probably recommend doing something like Variant 2 instead. This is basically just applying this function to the normalized_data array: 1 - np.abs(normalized_data - np.mean(normalized_data, axis=0)) and then doing all the similarity calculations with this new matrix of fingerprints.