[Converter module] Conversion between similarity and dissimilarity/distance

FanwangM commented 1 year ago

Shall we add a new module named converter to convert the similarity and dissimilarity/distance back and forth?

[x] Translate the functions from https://rdrr.io/cran/smacof/man/sim2diss.html, but we are not going for the ranking method for now. More discussions are listed in #73.

PaulWAyers commented 1 year ago

I think making a converter is a good idea.

PaulWAyers commented 1 year ago

The basic structure of the converter would be to read in a similarity (or dissimilarity) and then convert it to the other, together with an option that specifies the specific mapping to be used. A basic test is to ensure that the input is recovered after a forward-and-back conversion. (However, not every mapping is invertible.)

AWBroscius commented 1 year ago

The sim2diss paper splits similarity measures into 4 types: similarities, correlations, frequencies, and proportions/probabilties. (See Table 2 in the document)

Do we need conversions for all 4 types of measurement? If so, should we split them into different functions? ( i.e. sim_to_dist(), corr_to_dist(), freq_to_dist() ) There are 12 different conversion methods listed in the paper, and that seems like a lot for a single function to handle.

PaulWAyers commented 1 year ago

My reading of this is that correlations, frequencies, proportions/probabilities, and "similarities" are all types of similarity, in the sense that they are "big for similar things" and "small for dissimilar things." So it is still a similarity <-> dissimilarity issue.

It could be good to offer a simple "scale" parameter to decide whether you want similarities (before conversion or after conversion) to be scaled. Correlations are the "special" similarities where the self-similarity is 1. (Think of relatioship between the statistically correlation matrix (R^2) and the covariance matrix.)

Note that the covariance distance that was mentioned in the old issue (with a typo leaving out the square root) is missing in Table 2,

$$ d(i,j) = \sqrt{s(i,i) + s(j,j) - 2 s(i,j)} $$

Note also that some of these formulas require that similarities are no greater than one. Others merely require that $s(i,j)$ is a positive semidefinite matrix (so that the distance is a positive semidefinite matrix). We may need to implement a "checker" that sees whether matrices are positive semidefinite, but in general we do not have all the distances/similarities. Not that a hard-zero eigenvalue basically indicates that some data is a linear combination of the other data (or, somewhat commonly, that the same data was entered redundantly).

If I were to go ambitious (and I shouldn't), I'd argue for a generic converter from distance to covariance (the reverse is the first equation in this file, as I recall) for the whole Matern class.

AWBroscius commented 1 year ago

I am almost done implementing the rest of the functions in Table 2, as well as the covariance distance. The code I have added is for only the direction similarity --> distance. I have added error checks in individual functions for out of bounds values where they seemed appropriate: correlation, membership, confusion, transition, and probability. Please feel free to let me know if there are any others I have missed. I did these at the individual level because there are a variety of different bounding ranges for the functions used by the converter.

Additionally, the code currently assumes that the given input will be a symmetric similarity matrix. Is this what is expected? If not, what other forms of input should the converter be expected to take in?

FarnazH commented 1 year ago

@AWBroscius, if the data is given as a 2D matrix, it should be symmetric. However, it should also support a 1D array of similarity values (e.g., the upper diagonal elements of the 2D similarity matrix) or (if not too complicated) just a single similarity value.

FanwangM commented 1 year ago

This issue has been addressed and I am closing it now.

theochem / Selector

[Converter module] Conversion between similarity and dissimilarity/distance #123