rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

GNPS-similarity measure #171

Closed jorainer closed 3 years ago

jorainer commented 3 years ago

GNPS considers also the precursor m/z of the spectra in the peak matching. Thus, to allow that, we would need to pass the precursor m/z of the spectra to the join function in compareSpectra.

jorainer commented 3 years ago

Passing additional parameters xPrecursorMz and yPrecursorMz by the .compare_spectra_chunk_prec function to the .peak_compare function has no impact on performance (even not for MsBackendMassbankSql that retrieves data from the database).

jorainer commented 3 years ago

GNPS uses a different way to match peaks between spectra. In addition to the direct matching peaks the algorithm considers peaks matching if their m/z difference is the same as the difference in the spectras' precursor m/z (please correct if I got it wrong @michaelwitting).

The first implementation is from @michaelwitting based on the R code from this publication

join_gnps <- function(x, y, xPrecursorMz, yPrecursorMz, tolerance = 0, ppm = 0) {

  # calculate differences of precursor
  precursorDiff <- yPrecursorMz - xPrecursorMz

  # create alignment matrix
  alignment <- data.frame(matrix(ncol = 4))
  colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")

  h <- 1

  for (m in 1:nrow(x)){

    mz.diff1 <- abs(x[m,1] - y[,1])
    mz.diff2 <- abs(x[m,1] - y[,1] + precursorDiff)

    if(min(mz.diff1) <= tolerance){

      alignment[h,1:2] <- x[m,1:2]
      alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
      alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]

      h <- h + 1

    }

    if(precursorDiff != 0) {

      if(min(mz.diff2) <= tolerance){

        alignment[h,1:2] <- x[m,1:2]
        alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
        alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]

        h <- h + 1

      }
    }
  }

  alignment <- alignment[complete.cases(alignment),]

  alignment_index <- matrix(c(match(alignment$mz.x, x), match(alignment$mz.y, y)), ncol=2)
  colnames(alignment_index) <- c("x", "y")

  alignment_index

}

A modified version of this that uses the MsCoreUtils::join function is:

joinPeaksGnps <- function(x, y, xPrecursorMz = NA_real_,
                          yPrecursorMz = NA_real_, tolerance = 0,
                          ppm = 0, type = "left") {
    pdiff <- yPrecursorMz - xPrecursorMz
    map <- join(x[, 1L], y[, 1L], tolerance = tolerance, ppm = ppm,
                type = type, .check = FALSE)
    if (is.finite(pdiff) && pdiff != 0) {
        pmap <- join(x[, 1L] + pdiff, y[, 1L], tolerance = tolerance,
                     ppm = ppm, type = type, .check = FALSE)
        ## Keep only matches here
        nona <- !(is.na(pmap[[1L]]) | is.na(pmap[[2L]]))
        if (any(nona)) {
            map[[1L]] <- c(map[[1L]], pmap[[1L]][nona])
            map[[2L]] <- c(map[[2L]], pmap[[2L]][nona])
            idx <- order(map[[1L]])
            map[[1L]] <- map[[1L]][idx]
            map[[2L]] <- map[[2L]][idx]
        }
    }
    map
}

Both give the same results:

x <- matrix(c(10, 36, 63, 91, 93,
              14, 15, 999, 650, 1),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))

xPrecursorMz <- 91

y <- matrix(c(10, 12, 50, 63, 105,
              35, 5, 16, 999, 450),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))

yPrecursorMz <- 105

library(MsCoreUtils)
library(testthat)
library(microbenchmark)
## get first indices of matching m/z
res_a <- join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- do.call(cbind, res_b)
res_b <- res_b[!(is.na(res_b[, 1]) | is.na(res_b[, 2])), ]
expect_equal(res_a, res_b)
[1] TRUE

Note that for the above example join would only consider 2 peaks matching between the two spectra, if the difference in precursor m/z is considered 4 peaks are matching.

The performance of the join-based implementation is better:

> microbenchmark(join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm),
+ joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm))
Unit: microseconds
                                                            expr     min
     join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 547.633
 joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 113.317
       lq     mean   median      uq     max neval cld
 575.3675 603.8170 596.9870 610.123 820.809   100   b
 125.5170 145.9055 143.4905 148.039 402.905   100  a 
michaelwitting commented 3 years ago

Great. Did you already test with multiple matches? E.g. one fragment in x have a direct match and a mass-shifted match in y?

jorainer commented 3 years ago

Multi-matches:

a <- cbind(mz = c(10, 26, 36, 63, 91, 93),
           intensity = c(14, 10, 15, 999, 650, 1))
a_pmz <- 91

b <- cbind(mz = c(10, 12, 24, 50, 63, 105),
           intensity = c(35, 10, 5, 16, 999, 450))
b_pmz <- 105

res <- joinPeaksGnps(a, b, xPrecursorMz = a_pmz, yPrecursorMz = b_pmz)

res
$x
      mz intensity
 [1,] 10        14
 [2,] 10        14
 [3,] 26        10
 [4,] 36        15
 [5,] 36        15
 [6,] 63       999
 [7,] 91       650
 [8,] 91       650
 [9,] 93         1

$y
       mz intensity
 [1,]  10        35
 [2,]  24         5
 [3,]  NA        NA
 [4,]  NA        NA
 [5,]  50        16
 [6,]  63       999
 [7,]  NA        NA
 [8,] 105       450
 [9,]  NA        NA

so, the first peak in a (m/z 10) matches peak 1 in b directly and peak 2 based on the precursor difference. Should be correct like that, right?

jorainer commented 3 years ago

The GNPS score implementation in R from this publication is:

GNPS.score <- function(x, x.pre, y, y.pre, mz.tol){
  dm <- y.pre - x.pre
  #square root transforms
  x[,2] <- sqrt(x[,2])
  y[,2] <- sqrt(y[,2])
  #scale intensity values
  x[,2] <- x[,2]/sqrt(sum(x[,2]^2))
  y[,2] <- y[,2]/sqrt(sum(y[,2]^2))
  # create alignment matrix
  alignment <- data.frame(matrix(ncol = 4))
  colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
  h <- 1
  for (m in 1:nrow(x)){
    mz.diff1 <- abs(x[m,1] - y[,1])
    mz.diff2 <- abs(x[m,1] - y[,1] + dm)
    if(min(mz.diff1) <= mz.tol){
      alignment[h,1:2] <- x[m,1:2]
      alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
      alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
      h <- h + 1
    }
    if(dm != 0){
      if(min(mz.diff2) <= mz.tol){
        alignment[h,1:2] <- x[m,1:2]
        alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
        alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
        h <- h + 1
      }
    }
  }
  alignment <- alignment[complete.cases(alignment),]

  #create a matrix for selecting the highest scoring subset of matching peaks using 'clue' library
  #each peak is matched to at most one peak in another spectrum
  if(nrow(alignment)==0){score <- 0 }
  if(nrow(alignment)>0){
    mzfrag.x <- unique(alignment$mz.x)
    mzfrag.y <- unique(alignment$mz.y)
    max.length <- max(length(mzfrag.x),length(mzfrag.y))
    matrix <- matrix(0, nrow = max.length, ncol = max.length) #matrix nrow=ncol

    ## Why the hell this code below???
    #name the matrix rows and cols with mz
    if(length(mzfrag.x)<=length(mzfrag.y)){
      rownames(matrix) <- c(mzfrag.x, rep(0,(max.length-length(mzfrag.x))))
      colnames(matrix) <- mzfrag.y
    }
    if(length(mzfrag.x)>length(mzfrag.y)){
      rownames(matrix) <- mzfrag.x
      colnames(matrix) <- c(mzfrag.y,rep(0,(max.length-length(mzfrag.y))))
    }

    #fill the matrix with intensity product
    for(h in 1:nrow(alignment)){
      matrix[mzfrag.x==alignment[h,1],mzfrag.y==alignment[h,4]] <- alignment[h,2] * alignment[h,3]
    }
    if(length(mzfrag.x)<length(mzfrag.y)){matrix[(length(mzfrag.x)+1):max.length,]<-0}
    if(length(mzfrag.x)>length(mzfrag.y)){matrix[,(length(mzfrag.y)+1):max.length]<-0}
    #LSAP problem in 'clue' library
    optimal <- solve_LSAP(matrix, maximum = TRUE)
    #calculate GNPS score
    AB <- 0
    for (n in 1:max.length){AB <- AB + matrix[n,optimal[n]] }
    score <- as.numeric(AB)
  }
  return(score)
}

We've split this up into a mapping function joinPeaksGnps and two versions of the score function, one that uses the clue package and one that uses matrixStats:

gnps2 <- function(x, y) {
    if (nrow(x) != nrow(y))
        stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
             "having the same number of rows).")
    ## Scale intensities; !duplicated because we can have duplicated matches.
    x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
    y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
    ## is 0 if only NAs in input - avoids division through 0
    if (x_sum == 0 || y_sum == 0)
        return(0)
    ## Keep only matches.
    keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
    l <- length(keep)
    if (!l)
        return(0)
    x <- x[keep, , drop = FALSE]
    y <- y[keep, , drop = FALSE]
    scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)

    x_idx <- as.integer(factor(x[, 1]))
    y_idx <- as.integer(factor(y[, 1]))
    score_mat <- matrix(0, nrow = l, ncol = l)
    seq_l <- seq_len(l)
    for (i in seq_l) {
        score_mat[x_idx[i], y_idx[i]] <- scores[i]
    }
    best <- clue::solve_LSAP(score_mat, maximum = TRUE)
    sum(score_mat[cbind(seq_l, best)], na.rm = TRUE)
}

gnps3 <- function(x, y) {
    if (nrow(x) != nrow(y))
        stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
             "having the same number of rows).")
    ## Scale intensities; !duplicated because we can have duplicated matches.
    x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
    y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
    ## is 0 if only NAs in input - avoids division through 0
    if (x_sum == 0 || y_sum == 0)
        return(0)
    ## Keep only matches.
    keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
    l <- length(keep)
    if (!l)
        return(0)
    x <- x[keep, , drop = FALSE]
    y <- y[keep, , drop = FALSE]
    scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)

    fx <- factor(x[, 1])
    fy <- factor(y[, 1])
    if (length(levels(fy)) > length(levels(fx))) {
        row_idx <- as.integer(fx)
        col_idx <- as.integer(fy)
    } else {
        row_idx <- as.integer(fy)
        col_idx <- as.integer(fx)
    }
    score_mat <- matrix(0, nrow = l, ncol = l)
    seq_l <- seq_len(l)
    for (i in seq_l) {
        score_mat[row_idx[i], col_idx[i]] <- scores[i]
    }
    sum(matrixStats::rowMaxs(score_mat, na.rm = TRUE), na.rm = TRUE)
}

And the tests on these implementations: first with two spectra with some multi matching peaks:

x <- matrix(c(10, 26, 36, 63, 91, 93,
              14, 10, 15, 999, 650, 1),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91

y <- matrix(c(10, 24, 12, 50, 63, 105,
              35, 10, 5, 16, 999, 450),
            ncol = 2,
            dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105

map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
      mz intensity
 [1,] 10        14
 [2,] 10        14
 [3,] 26        10
 [4,] 36        15
 [5,] 36        15
 [6,] 63       999
 [7,] 91       650
 [8,] 91       650
 [9,] 93         1
[10,] NA        NA
[11,] NA        NA
[12,] NA        NA
[13,] NA        NA

$y
       mz intensity
 [1,]  10        35
 [2,]  24        10
 [3,]  NA        NA
 [4,]  NA        NA
 [5,]  50        16
 [6,]  63       999
 [7,]  NA        NA
 [8,] 105       450
 [9,]  NA        NA
[10,]  24        10
[11,]  12         5
[12,]  50        16
[13,] 105       450

GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
gnps2(map$x, map$y)
gnps3(map$x, map$y)
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373

Results are the same, also when we reverse the order y->x:

> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373

And finally with a more complicated case in which peaks from x match multiple in y and vice versa:

x <- cbind(mz = c(10, 12, 14, 16, 19),
           intensity = 1:5)
xPrecursorMz <- 4
y <- cbind(mz = c(10, 14, 16, 17, 19, 22),
           intensity = 1:6)
yPrecursorMz <- 8
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
      mz intensity
 [1,] 10         1
 [2,] 10         1
 [3,] 12         2
 [4,] 12         2
 [5,] 14         3
 [6,] 16         4
 [7,] 19         5
 [8,] NA        NA
 [9,] NA        NA

$y
      mz intensity
 [1,] 10         1
 [2,] 14         2
 [3,] NA        NA
 [4,] 16         3
 [5,] 14         2
 [6,] 16         3
 [7,] 19         5
 [8,] 17         4
 [9,] 22         6

> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548

## and reverse:
> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548

So, seems that the implementations are correct (I would prefer the matrixStats version which is also slightly faster). But it would be good to test this also on other test data maybe also independent of the publication from which we have the R implementation. @michaelwitting do you have some maybe?

michaelwitting commented 3 years ago

I will check with my PhD student on some test data from GNPS. If nothing is available, we will generate some reference data.

jorainer commented 3 years ago

Note that I stumbled also over a maybe stupid inherent bug in my code: the similarity score will always be > 0 if both spectra contain also a peak representing the precursor m/z. I have to cross check if this makes sense but somehow I find it more intuitive that the match of the precursors between the two spectra would not be considered matching.

michaelwitting commented 3 years ago

Indeed, that is an issue. I need to check the original GNPS paper on this.

jorainer commented 3 years ago

For the record: I changed to the implementation with clue::solve_LSAP because the results from the other implementation did not match with the results from the reference GNPS.

jorainer commented 3 years ago

This is now added.