rformassspectrometry / MsCoreUtils

Core Utils for Mass Spectrometry Data
https://rformassspectrometry.github.io/MsCoreUtils/
16 stars 11 forks source link

Implement additional distance metrices #30

Open sgibb opened 4 years ago

sgibb commented 4 years ago

See also #29

See: https://doi.org/10.1016/1044-0305(94)87009-8 for Euclidean/Absolute value distance

@tnaake , @tobiasko any other needed?

tobiasko commented 4 years ago

Normalized spectral angle

Maybe we should ask somebody from ProteomeTools what is used in practice?

michaelwitting commented 4 years ago

What about having something similar to the functions used by GNPS? https://ccms-ucsd.github.io/GNPSDocumentation/massspecbackground/networkingtheory/ Could be nice for the metabolomics community.

jorainer commented 4 years ago

Nice idea @michaelwitting. Is there a publication, source code, reference implementation available?

michaelwitting commented 4 years ago

I'm not aware of any implementation in R yet.

jmbadia commented 3 years ago

We use cosine similarity as a measure to compare spectra in LC-MS/MS. Usually I apply the cosine() function from lsa package but now I have decided to use compareSpectra() with the following function

.weightxy <- function(x, y, m = 0, n = 0.5) {
    x ^ m * y ^ n
    }
cosSim <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
 wx <- .weightxy(x[, 1L], x[, 2L], m, n)
 wy <- .weightxy(y[, 1L], y[, 2L], m, n)
sum(wx^2L*wy^2L, na.rm = na.rm)/(sqrt(sum(wx^4L, na.rm = na.rm))*sqrt(sum(wy^4L, na.rm = na.rm)))
}

compareSpectra(data[1], data[2], FUN=cosSim)

I read in your pull that cosine similarity measure corresponds to the standard normalized dotproduct and that you are applying the other definition (literature definition). So I have modified the code you are using to implement the cosine similarity measure. It seems that works (although I want to check it better),

jorainer commented 3 years ago

Great @jmbadia ! - feel free to make a pull request adding your function to MsCoreUtils ("distance.R" file) - maybe calling it cosine (or ncosine if it's using the weighting function @sgibb ?). If so, please include also unit tests comparing against results you would get from a reference implementation.

jmbadia commented 3 years ago

Ok @jorainer. Happy to help.

jmbadia commented 3 years ago

@jorainer @sgibb, let me share what I have (related also with this pull),

A) Stein and Scott dot product is actually the conventional dot product (cosine) squared. Stein and Scott 1994 based their dot product measure on a technical manual of a commercial software (The Finnigan Library Search Program, 1978). Finnigan manual used a variable called purity= 1000 x dotproduct² and I guess Stein and Scott took the formula developed without being aware of the square a

B) Best spectra comparison algorithms in Stein and Scott 1994 are dotproduct algorithm (squared or not) with m>1 and n~0.5 as optimal exponents. The point is that intensities must be normalized with the mass range somehow (otherwise same n values give different relative weights) and I assume Stein et al did it as Finnegan's manual tells ("[...] since the library unknown always has a base peak intensity of 1000", i.e. base peak of every spectra = 1000). This normalization is not related with the so called optional global/local normalization, described in the manual as a useful tool to "corrects for the differences that may exist between two spectra of the same compound acquired in different ways" .

So, what now?. I guess ...? A) there is no point in adding a new ncos() function (it is √(ndotproduct()) B) ndotproduct() should be modified C) intensities mus be normalized (base peak = 1000) in order to fit the default m and n values

Sorry if I am too direct here but my wife is waiting me for dinner...

jorainer commented 3 years ago

Thanks for the comprehensive description @jmbadia !

A) there is no point in adding a new ncos() function (it is √(ndotproduct())

I agree. But then we should mention this in the documentation.

Regarding B) and C), whether or not intensity values of a spectrum are normalized to values relative to the base peak intensity (whether that is set to 100 or 1000) seems to have no influence on the ndotproduct result:

a <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.99, 177.99, 195.02, 241.03),
    intensity = c(15800, 110400, 58100, 117900, 11100, 7700, 15300, 64400))
a_100 <- a
a_100[, 2] <- a[, 2] / max(a[, 2]) * 100
a_1000 <- a
a_1000[, 2] <- a[, 2] / max(a[, 2]) * 1000

b_100 <- cbind(
    mz = c(74.03, 120.01, 122.02, 151.98, 153.88, 177.99, 195.02, 241.03),
    intensity = c(3.47, 9.63, 8.36, 51.12, 7.75, 3.24, 10.15, 100))
b_1000 <- b_100
b_1000[, 2] <- b_100[, 2] / max(b_100[, 2]) * 1000

library(MsCoreUtils)

## One normalized against one not normalized.
ndotproduct(x = a, y = b_100)
[1] 0.7838264

## Both normalized to 100
ndotproduct(x = a_100, y = b_100)
[1] 0.7838264

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000)
[1] 0.7838264

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000)
[1] 0.7838264

## Repeat by changing m and n
ndotproduct(x = a, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 100
ndotproduct(x = a_100, y = b_100, n = 0.9, m = 0.7)
[1] 0.5769457

## Both normalized to 1000
ndotproduct(x = a_1000, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

## One normalized to 100, one to 1000
ndotproduct(x = a_100, y = b_1000, n = 0.9, m = 0.7)
[1] 0.5769457

Seems that from a mathematical standpoint the similarity calculation is independent on the scale of the intensity values, so I think there is no need to modify ndotproduct (please correct me if I did something completely wrong in my comparison above).

jmbadia commented 3 years ago

Hi @jorainer .

jorainer commented 3 years ago

Clearly...

I have to thank you for rising this potential problem. This made me investigate and now I'm confident that it is working. Things like that increase the trust in the package's functionality!

On B)

We used the definition from Stein and cite their paper - thus I would like to keep the dotproduct function as described there. Yes, please add a comment in the documentation. Thanks!