rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Bug in containsMz() #258

Closed lgatto closed 1 year ago

lgatto commented 1 year ago

Test data

> sciex_file <- dir(system.file("sciex", package = "msdata"),
+ full.names = TRUE)
> sciex <- Spectra(sciex_file, backend = MsBackendMzR())

BUG: mz must be ordered

> containsMz(sciex, mz = c(123.7, 109.07), which = "any")
Error: BiocParallel errors
  2 remote errors, element index: 1, 2
  0 unevaluated and other errors
  first remote error:
Error in closest(x, table, tolerance = tolerance, ppm = ppm, duplicates = duplicates, : 'x' and 'table' have to be sorted non-decreasingly and must not contain NA.
> containsMz(sciex, mz = c(109.07, 123.7), which = "any")
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [ reached getOption("max.print") -- omitted 1762 entries ]

Suggested fix:

Add mz <- sort(mz) at the beginning of the function, possibly with a message indicating that the mz values get sorted.

@jorainer @sgibb - I'll send a PR. Any comments/suggestion?

lgatto commented 1 year ago

Here's the suggested addition:

    if (!identical(order(mz), seq_along(mz))) {
        message("Ordering mz values.")
        mz <- sort(mz)
    }

and how it works

> x1 <- containsMz(sciex, mz = c(123.7, 109.07), which = "any")
Ordering mz values.
> x2 <- containsMz(sciex, mz = c(109.07, 123.7), which = "any")
> identical(x1, x2)
[1] TRUE
lgatto commented 1 year ago

There currently is still a bug in #259

> containsMz(sciex, mz = c(NA, 109.07), which = "any")
Error in if (is.unsorted(mz)) mz <- sort(mz) (from Spectra.R#1568) : 
  missing value where TRUE/FALSE needed
> is.unsorted(c(NA, 109/07))
[1] NA

and event using is.unsorted(na.rm = TRUE) won't fix it, as it will fail :

> containsMz(sciex, mz = c(NA, 109.07), which = "any")
Error: BiocParallel errors
  2 remote errors, element index: 1, 2
  0 unevaluated and other errors
  first remote error:
Error in closest(x, table, tolerance = tolerance, ppm = ppm, duplicates = duplicates, : 'x' and 'table' have to be sorted non-decreasingly and must not contain NA.

Suggested implementation

Shouldn't we just get rid of the NA at the beginning? What about the following implementation, that

  1. returns a vector of NAs of all mz are missing.
  2. Otherwise, by default, sorts and drops NAs. In don't think this is an overhead, given that all the time is spent in .haz_mz() anyway.
  3. Avoids unnecessary calls to .has_mz() if mz values are duplicated.
setMethod("containsMz", "Spectra", function(object, mz = numeric(),
                                            tolerance = 0,
                                            ppm = 20, which = c("any", "all"),
                                            BPPARAM = bpparam()) {
    cond_fun <- match.fun(match.arg(which))
    if (all(is.na(mz)))
        return(rep(NA, length(object)))
    mz <- unique(sort(mz))
    if (is(BPPARAM, "SerialParam"))
        .has_mz(object, mz, tolerance = tolerance, ppm = ppm,
                condFun = cond_fun, parallel = BPPARAM)
    else {
        sp <- SerialParam()
        f <- as.factor(dataStorage(object))
        res <- .lapply(object, FUN = .has_mz, mz = mz, tolerance = tolerance,
                       condFun = cond_fun, parallel = sp, f = f,
                       BPPARAM = BPPARAM)
        unsplit(res, f = f)
    }
})
sgibb commented 1 year ago

I completely agree with your suggestion to get rid of all NA at the beginning. In general the mz argument should be a relatively small vector and so sorting should not have an mearsurable impact on the runtime.

jorainer commented 1 year ago

Two things:

lgatto commented 1 year ago
lgatto commented 1 year ago

Following on from your comment in the PR:

AFAIK we require the m/z values to be sorted

they won't necessarily be if the user provides them - see also the discussion in #260

jorainer commented 1 year ago

Sorry, I did not understand that you wanted to sort the input parameter mz. All OK with that. Also for removing the NA values first. I thought you refer to the m/z values within the Spectra object.