Closed jorainer closed 3 years ago
Passing additional parameters xPrecursorMz
and yPrecursorMz
by the .compare_spectra_chunk_prec
function to the .peak_compare
function has no impact on performance (even not for MsBackendMassbankSql
that retrieves data from the database).
GNPS uses a different way to match peaks between spectra. In addition to the direct matching peaks the algorithm considers peaks matching if their m/z difference is the same as the difference in the spectras' precursor m/z (please correct if I got it wrong @michaelwitting).
The first implementation is from @michaelwitting based on the R code from this publication
join_gnps <- function(x, y, xPrecursorMz, yPrecursorMz, tolerance = 0, ppm = 0) {
# calculate differences of precursor
precursorDiff <- yPrecursorMz - xPrecursorMz
# create alignment matrix
alignment <- data.frame(matrix(ncol = 4))
colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
h <- 1
for (m in 1:nrow(x)){
mz.diff1 <- abs(x[m,1] - y[,1])
mz.diff2 <- abs(x[m,1] - y[,1] + precursorDiff)
if(min(mz.diff1) <= tolerance){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
h <- h + 1
}
if(precursorDiff != 0) {
if(min(mz.diff2) <= tolerance){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
h <- h + 1
}
}
}
alignment <- alignment[complete.cases(alignment),]
alignment_index <- matrix(c(match(alignment$mz.x, x), match(alignment$mz.y, y)), ncol=2)
colnames(alignment_index) <- c("x", "y")
alignment_index
}
A modified version of this that uses the MsCoreUtils::join
function is:
joinPeaksGnps <- function(x, y, xPrecursorMz = NA_real_,
yPrecursorMz = NA_real_, tolerance = 0,
ppm = 0, type = "left") {
pdiff <- yPrecursorMz - xPrecursorMz
map <- join(x[, 1L], y[, 1L], tolerance = tolerance, ppm = ppm,
type = type, .check = FALSE)
if (is.finite(pdiff) && pdiff != 0) {
pmap <- join(x[, 1L] + pdiff, y[, 1L], tolerance = tolerance,
ppm = ppm, type = type, .check = FALSE)
## Keep only matches here
nona <- !(is.na(pmap[[1L]]) | is.na(pmap[[2L]]))
if (any(nona)) {
map[[1L]] <- c(map[[1L]], pmap[[1L]][nona])
map[[2L]] <- c(map[[2L]], pmap[[2L]][nona])
idx <- order(map[[1L]])
map[[1L]] <- map[[1L]][idx]
map[[2L]] <- map[[2L]][idx]
}
}
map
}
Both give the same results:
x <- matrix(c(10, 36, 63, 91, 93,
14, 15, 999, 650, 1),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91
y <- matrix(c(10, 12, 50, 63, 105,
35, 5, 16, 999, 450),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105
library(MsCoreUtils)
library(testthat)
library(microbenchmark)
## get first indices of matching m/z
res_a <- join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm)
res_b <- do.call(cbind, res_b)
res_b <- res_b[!(is.na(res_b[, 1]) | is.na(res_b[, 2])), ]
expect_equal(res_a, res_b)
[1] TRUE
Note that for the above example join
would only consider 2 peaks matching between the two spectra, if the difference in precursor m/z is considered 4 peaks are matching.
The performance of the join
-based implementation is better:
> microbenchmark(join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm),
+ joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm))
Unit: microseconds
expr min
join_gnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 547.633
joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, tolerance, ppm) 113.317
lq mean median uq max neval cld
575.3675 603.8170 596.9870 610.123 820.809 100 b
125.5170 145.9055 143.4905 148.039 402.905 100 a
Great. Did you already test with multiple matches? E.g. one fragment in x have a direct match and a mass-shifted match in y?
Multi-matches:
a <- cbind(mz = c(10, 26, 36, 63, 91, 93),
intensity = c(14, 10, 15, 999, 650, 1))
a_pmz <- 91
b <- cbind(mz = c(10, 12, 24, 50, 63, 105),
intensity = c(35, 10, 5, 16, 999, 450))
b_pmz <- 105
res <- joinPeaksGnps(a, b, xPrecursorMz = a_pmz, yPrecursorMz = b_pmz)
res
$x
mz intensity
[1,] 10 14
[2,] 10 14
[3,] 26 10
[4,] 36 15
[5,] 36 15
[6,] 63 999
[7,] 91 650
[8,] 91 650
[9,] 93 1
$y
mz intensity
[1,] 10 35
[2,] 24 5
[3,] NA NA
[4,] NA NA
[5,] 50 16
[6,] 63 999
[7,] NA NA
[8,] 105 450
[9,] NA NA
so, the first peak in a
(m/z 10) matches peak 1 in b
directly and peak 2 based on the precursor difference. Should be correct like that, right?
The GNPS score implementation in R from this publication is:
GNPS.score <- function(x, x.pre, y, y.pre, mz.tol){
dm <- y.pre - x.pre
#square root transforms
x[,2] <- sqrt(x[,2])
y[,2] <- sqrt(y[,2])
#scale intensity values
x[,2] <- x[,2]/sqrt(sum(x[,2]^2))
y[,2] <- y[,2]/sqrt(sum(y[,2]^2))
# create alignment matrix
alignment <- data.frame(matrix(ncol = 4))
colnames(alignment) <- c("mz.x", "int.x", "int.y","mz.y")
h <- 1
for (m in 1:nrow(x)){
mz.diff1 <- abs(x[m,1] - y[,1])
mz.diff2 <- abs(x[m,1] - y[,1] + dm)
if(min(mz.diff1) <= mz.tol){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff1 == min(mz.diff1),2][1]
alignment[h,4] <- y[mz.diff1 == min(mz.diff1),1][1]
h <- h + 1
}
if(dm != 0){
if(min(mz.diff2) <= mz.tol){
alignment[h,1:2] <- x[m,1:2]
alignment[h,3] <- y[mz.diff2 == min(mz.diff2),2][1]
alignment[h,4] <- y[mz.diff2 == min(mz.diff2),1][1]
h <- h + 1
}
}
}
alignment <- alignment[complete.cases(alignment),]
#create a matrix for selecting the highest scoring subset of matching peaks using 'clue' library
#each peak is matched to at most one peak in another spectrum
if(nrow(alignment)==0){score <- 0 }
if(nrow(alignment)>0){
mzfrag.x <- unique(alignment$mz.x)
mzfrag.y <- unique(alignment$mz.y)
max.length <- max(length(mzfrag.x),length(mzfrag.y))
matrix <- matrix(0, nrow = max.length, ncol = max.length) #matrix nrow=ncol
## Why the hell this code below???
#name the matrix rows and cols with mz
if(length(mzfrag.x)<=length(mzfrag.y)){
rownames(matrix) <- c(mzfrag.x, rep(0,(max.length-length(mzfrag.x))))
colnames(matrix) <- mzfrag.y
}
if(length(mzfrag.x)>length(mzfrag.y)){
rownames(matrix) <- mzfrag.x
colnames(matrix) <- c(mzfrag.y,rep(0,(max.length-length(mzfrag.y))))
}
#fill the matrix with intensity product
for(h in 1:nrow(alignment)){
matrix[mzfrag.x==alignment[h,1],mzfrag.y==alignment[h,4]] <- alignment[h,2] * alignment[h,3]
}
if(length(mzfrag.x)<length(mzfrag.y)){matrix[(length(mzfrag.x)+1):max.length,]<-0}
if(length(mzfrag.x)>length(mzfrag.y)){matrix[,(length(mzfrag.y)+1):max.length]<-0}
#LSAP problem in 'clue' library
optimal <- solve_LSAP(matrix, maximum = TRUE)
#calculate GNPS score
AB <- 0
for (n in 1:max.length){AB <- AB + matrix[n,optimal[n]] }
score <- as.numeric(AB)
}
return(score)
}
We've split this up into a mapping function joinPeaksGnps
and two versions of the score function, one that uses the clue
package and one that uses matrixStats
:
gnps2 <- function(x, y) {
if (nrow(x) != nrow(y))
stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
"having the same number of rows).")
## Scale intensities; !duplicated because we can have duplicated matches.
x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
## is 0 if only NAs in input - avoids division through 0
if (x_sum == 0 || y_sum == 0)
return(0)
## Keep only matches.
keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
l <- length(keep)
if (!l)
return(0)
x <- x[keep, , drop = FALSE]
y <- y[keep, , drop = FALSE]
scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)
x_idx <- as.integer(factor(x[, 1]))
y_idx <- as.integer(factor(y[, 1]))
score_mat <- matrix(0, nrow = l, ncol = l)
seq_l <- seq_len(l)
for (i in seq_l) {
score_mat[x_idx[i], y_idx[i]] <- scores[i]
}
best <- clue::solve_LSAP(score_mat, maximum = TRUE)
sum(score_mat[cbind(seq_l, best)], na.rm = TRUE)
}
gnps3 <- function(x, y) {
if (nrow(x) != nrow(y))
stop("'x' and 'y' are expected to be aligned peak matrices (i.e. ",
"having the same number of rows).")
## Scale intensities; !duplicated because we can have duplicated matches.
x_sum <- sum(x[!duplicated(x[, 1]), 2], na.rm = TRUE)
y_sum <- sum(y[!duplicated(y[, 1]), 2], na.rm = TRUE)
## is 0 if only NAs in input - avoids division through 0
if (x_sum == 0 || y_sum == 0)
return(0)
## Keep only matches.
keep <- which(complete.cases(cbind(x[, 1], y[, 1])))
l <- length(keep)
if (!l)
return(0)
x <- x[keep, , drop = FALSE]
y <- y[keep, , drop = FALSE]
scores <- sqrt(x[, 2]) / sqrt(x_sum) * sqrt(y[, 2]) / sqrt(y_sum)
fx <- factor(x[, 1])
fy <- factor(y[, 1])
if (length(levels(fy)) > length(levels(fx))) {
row_idx <- as.integer(fx)
col_idx <- as.integer(fy)
} else {
row_idx <- as.integer(fy)
col_idx <- as.integer(fx)
}
score_mat <- matrix(0, nrow = l, ncol = l)
seq_l <- seq_len(l)
for (i in seq_l) {
score_mat[row_idx[i], col_idx[i]] <- scores[i]
}
sum(matrixStats::rowMaxs(score_mat, na.rm = TRUE), na.rm = TRUE)
}
And the tests on these implementations: first with two spectra with some multi matching peaks:
x <- matrix(c(10, 26, 36, 63, 91, 93,
14, 10, 15, 999, 650, 1),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
xPrecursorMz <- 91
y <- matrix(c(10, 24, 12, 50, 63, 105,
35, 10, 5, 16, 999, 450),
ncol = 2,
dimnames = list(c(), c("mz", "intensity")))
yPrecursorMz <- 105
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
mz intensity
[1,] 10 14
[2,] 10 14
[3,] 26 10
[4,] 36 15
[5,] 36 15
[6,] 63 999
[7,] 91 650
[8,] 91 650
[9,] 93 1
[10,] NA NA
[11,] NA NA
[12,] NA NA
[13,] NA NA
$y
mz intensity
[1,] 10 35
[2,] 24 10
[3,] NA NA
[4,] NA NA
[5,] 50 16
[6,] 63 999
[7,] NA NA
[8,] 105 450
[9,] NA NA
[10,] 24 10
[11,] 12 5
[12,] 50 16
[13,] 105 450
GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
gnps2(map$x, map$y)
gnps3(map$x, map$y)
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373
Results are the same, also when we reverse the order y->x:
> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.9861373
> gnps2(map$x, map$y)
[1] 0.9861373
> gnps3(map$x, map$y)
[1] 0.9861373
And finally with a more complicated case in which peaks from x
match multiple in y
and vice versa:
x <- cbind(mz = c(10, 12, 14, 16, 19),
intensity = 1:5)
xPrecursorMz <- 4
y <- cbind(mz = c(10, 14, 16, 17, 19, 22),
intensity = 1:6)
yPrecursorMz <- 8
map <- joinPeaksGnps(x, y, xPrecursorMz, yPrecursorMz, type = "outer")
map
$x
mz intensity
[1,] 10 1
[2,] 10 1
[3,] 12 2
[4,] 12 2
[5,] 14 3
[6,] 16 4
[7,] 19 5
[8,] NA NA
[9,] NA NA
$y
mz intensity
[1,] 10 1
[2,] 14 2
[3,] NA NA
[4,] 16 3
[5,] 14 2
[6,] 16 3
[7,] 19 5
[8,] 17 4
[9,] 22 6
> GNPS.score(x, xPrecursorMz, y, yPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548
## and reverse:
> map <- joinPeaksGnps(y, x, yPrecursorMz, xPrecursorMz, type = "outer")
> GNPS.score(y, yPrecursorMz, x, xPrecursorMz, mz.tol = 0.1)
[1] 0.6712548
> gnps2(map$x, map$y)
[1] 0.6712548
> gnps3(map$x, map$y)
[1] 0.6712548
So, seems that the implementations are correct (I would prefer the matrixStats
version which is also slightly faster). But it would be good to test this also on other test data maybe also independent of the publication from which we have the R implementation. @michaelwitting do you have some maybe?
I will check with my PhD student on some test data from GNPS. If nothing is available, we will generate some reference data.
Note that I stumbled also over a maybe stupid inherent bug in my code: the similarity score will always be > 0 if both spectra contain also a peak representing the precursor m/z. I have to cross check if this makes sense but somehow I find it more intuitive that the match of the precursors between the two spectra would not be considered matching.
Indeed, that is an issue. I need to check the original GNPS paper on this.
For the record: I changed to the implementation with clue::solve_LSAP
because the results from the other implementation did not match with the results from the reference GNPS.
This is now added.
GNPS considers also the precursor m/z of the spectra in the peak matching. Thus, to allow that, we would need to pass the precursor m/z of the spectra to the join function in
compareSpectra
.