mqcomplab / MultipleComparisons

GNU General Public License v3.0
33 stars 10 forks source link

calculation of eSALI #3

Open albertma1986 opened 1 year ago

albertma1986 commented 1 year ago

Hi,

I read the paper - "Exploring activity landscapes with extended similarity: is Tanimoto enough?"[https://onlinelibrary.wiley.com/doi/epdf/10.1002/minf.202300056]

I am trying to relate the code this repo and the equations mentioned in the paper, specifically this image

is the calculate_counters() function in the condensed_version/MultComp.py responsible for getting the S e(M) value?

Sorry if I missed anything in the paper or in the docstrings but I cannot see a formula of how S e(M) is calculated or which code is responsible for this?

My task is simple, I am just trying to calculate the eSALI for my dataset, I have the numerical descriptors and the properties of the compounds.

Albert

ramirandaq commented 1 year ago

Hi Albert, thanks a lot for your interest in our work and for reaching out to us. The calculate_counters function gives the main ingredients to then calculate the extended similarity (Se(M)), which then can be used in the eSALI formula. In this file https://github.com/ramirandaq/MultipleComparisons/blob/master/condensed_version/MultComp.py we have an updated version of the formula. Please, notice two things: 1- Below line 117 there's a sample calculation of how to proceed to get the Se(M) value. Notice that, given a set of fingerprints arranged in a matrix (line 121), the first step is to calculate the sum of every column (line 130, this is the most time-demanding step of the whole process), then one needs to generate a data_sets instance (line 133) where one appends the number of fingerprints (n) to the vector with the sum of the columns. This is the main input needed to calculate the counters. 2- Once the counters are calculated, starting in line 144, it shows how to calculate several extended similarity indices. First, I strongly recommend only calculating the non-weighted version of the index (starting in line 179). Second, if you want to calculate the extended Tanimoto index, please see line 200 (although in several studies we've seen that the Russell-Rao index can give comparable, if not better, results, see line 204). More importantly, please let us know if you have any other doubts/comments and if we can help with anything. If you want, we could send you a script with a more concise way to perform these calculations (this one is, purposely, very general, since we used as template for all the applications we are exploring in our group). If your dataset is too big, we also have more efficient ways to perform these calculations (although this one, as reported in the paper, already scales as O(N)). All the best, Ramon

albertma1986 commented 1 year ago

Hi Ramon, thanks so much for the explanation. As far as I understand the extended similarity framework, it could be extended to other similarity (distance) metrics (for instance Euclidean distance if I have a set of compounds, each represented by a latent vector (not binary))

I am not a Math expert but I believe it would not make sense passing such latent vectors matrix to the calculate_counters() function (please correct me if I am wrong). Is there example around of calculating "extended Euclidean distance" (i.e. the denominator, 1- Se(M) but in a sense of Euclidean distance) in the formula. image

Sorry if I am talking nonsense I am not sure if it is even doable. Thanks Albert

ramirandaq commented 1 year ago

Hi, no problem! We don't have the extended Euclidean in this module. It'll be tricky to do this with Euclidean, but relatively easy to do with the square of the Euclidean distance. Basically, instead of using the "RMSD" using the "MSD", without the square root.