rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
37 stars 25 forks source link

feat: filterMzValues supports removing peaks (issue #209) #210

Closed jorainer closed 3 years ago

jorainer commented 3 years ago
jorainer commented 3 years ago

I believe this small additional functionality to be highly valuable, the only thing I'm not completely sure about is the name of the additional parameter. I had remove but changed that to keep (to avoid having parameters with a negative meaning) - still there could also be other options/names. @sgibb @lgatto any thoughts?

lgatto commented 3 years ago

Great, indeed, and I'm happy with the keep = TRUE argument.

jorainer commented 3 years ago

Thanks @lgatto for the feedback. Then I'll merge and push to BioC

cbroeckl commented 3 years ago

Stumbled on this thread while looking for what appears to be this functionality. I am using the filterMzValues function, and if i submit a vector of mz values, the resultant list of vectors in the @listData have different lengths. would keep = TRUE result in a list of vectors of exactly the same length? I have 27 retention times in this filtered Spectra object.

>                 dms2b <- filterMzValues(dms2, mzs, ppm = 20)
>                 rt <- rtime(dms2b)
>                 int <- Spectra::intensity(dms2b)
> int
NumericList of length 27
[[1]] 0 0 28 0 0 0 0 0
[[2]] 0 12 0
[[3]] 23 28 0 0 0 0 12 0 52 0
[[4]] 61 0 0 0 0 39 0 0 0 164 0 0
[[5]] 0 30 25 0 0 33 80 0 10 21 0 396 0 0 0
[[6]] 48 0 0 29 0 0 27 0 0 22 16 19 0 10 92 0 0 0 251 0 0 0
[[7]] 0 12 138 0 0 0 49 30 54 16 0 0 0 0 14 0 0 0 413 0 0 16 0 0
[[8]] 0 58 0 24 0 0 0 31 0 0 0 10
[[9]] 53 24 56 0 0 0 14
[[10]] 0 0 0 0 0 0 0 0 0
...
<17 more elements>
jorainer commented 3 years ago

The filterMzValues will always reduce the number of peaks in each spectrum. With keep = TRUE only the peaks matching the provided m/z values will be retained (all others will be dropped) and with keep = FALSE the opposite is done: all peaks matching the provided m/z values will be removed. So, we're not simply setting the intensity of the to-be-removed peaks to 0 but are actually removing them (thus you get different lengths of the intensity numeric vectors within the list that is returned by intensity, the length of the list does not change.

lgatto commented 3 years ago

Would you have expected these elements to be set to 0, @cbroeckl ? (This was indeed was was implemented in MSnbase, with an additional function to remove 0 intensity peaks.) Would there be strong arguments in favour of one or the other solutions that would fit both profile and centroided data?

cbroeckl commented 3 years ago

@jorainer and @lgatto Thanks for the reply. I realize i wasn't very clear in my intent and data input. I am trying to retried relative raw data from SONAR MS2 spectra - scanningSwath is essentially the same - basically the quad is scanning across a given mass range and each TOF push represents a mass center/isolation width. I am playing with some methods to try to process these data to generate clean MS2 spectra. One method is to isolate all the MS2 spectra which pass the precursor ion of interest at the max MS1 signal scan or scans - in this case it is about 9 MS2 spectra for each of three MS1 scan. Then, combine those spectra and with/out some filtering select the masses in that spectraand retreive EICs which can be used for additional filtering based on EIC properties.

So i was trying to use filterMzValues to retrieve EICs for (in this test) 100 mass ranges - mzs, in my example is vector of mass values, with ppm specified in the function. It is clear that i am retrieving data for all spectra, but the truncated length of each element of the 27 spectra list means I am not retrieving data for all the mass ranges. From what i can tell, there is no index returned which would be able to tell me which mass range each value corresponds to, which means i would need to perform this one mass range at a time to ensure i can index everything appropriately.

so basically what i am after is a rectangular matrix of dimensions nSpectra x nMassRanges. NA values are fine, or zero values, when there is no signal detected in that mass range. It is certainly possible that i am misusing this function - there may be another in this or one of the other packages which can retrieve a complete rectangular matrix of EIC values. This would be applicable for either profile or centroid data. Thanks to you both for all the effort.

{{EDIT}}: what would be even better is a filter method that can return a rectangular array of multiple dimensions. If i filter on rt, mz, and IMS drift, a 3d array is returned instead of a matrix. Each cell has a value (i would prefer 0 to NA, but either could work) representing the total signal within those ranges. I think this would be a valuable base-level feature within the Spectra package. I am still figuring out how Spectra interacts with XCMS and MSnbase - maybe one of those tools accepts a Spectra object (XCMS::chromatogram does not appear to).

jorainer commented 3 years ago

I'm afraid I did not completely understand it, but if this data is somewhat similar to (standard) SWATH, then there should be an additional spectra variable that you could use for filtering (I think it is called isolationWindowLowerMz and isolationWindowUpperMz or similar). That would allow you to select all spectra that are generated within a certain isolation window. So, in xcms I'm first getting all MS2 spectra for a certain isolation window and then I'm running chromatographic peak detection on that set of spectra (treating them like MS1 data) - don't know if your approach would be somewhat similar?

Note also that maybe the filterMzRange might be a more suitable function? that one keeps all peaks that are within the provided m/z range (while filterMzValues keeps only peaks matching the provided m/z value(s)).

The thing with having rectangular MS data or have 0 or NA if something is not present: I find that always very tricky in MS data, because some instruments (like Sciex) simply don't report anything (no m/z) if they don't measure a signal there. So, a rectangular set of Spectra would be very difficult to get (what I mean is that all spectra have exactly the same number of peaks and simply a 0 if no signal was measured - also because the m/z will never be identical across Spectra).

Maybe an actual meeting would be nice for me to better understand what exactly you're looking for - eventually we could then implement something if it's not already possible with current Spectra?

cbroeckl commented 3 years ago

Thanks - unfortunately i am pretty terrible at trying to explain computational steps in words.... Sorry for that.

https://rdrr.io/bioc/xcms/man/rawEIC-methods.html

Basically, i am looking for a way to do this on a Spectra object - I am using the combineSpectra tools in Spectra, but the rawEIC (or getEIC) functions only work on on-disk files through mzR. rawEIC doesn't take a Spectra object as input. It doesn't seem efficient to read the raw data once for XCMS and a second time for Spectra, but i guess that could be a workaround.

I could perform peak detection on the subset rtRange, as you and Michael Witting have applied it for Swath data: https://bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms-lcms-ms.html. However, processing time is already long using this approach. I would like to be able to use this for every detected peak (ideally), so speeding things up will be necessary. Further, scanningSwath, or SONAR, is like Swath, except there are multiple overlapping DIA windows - in the case of Waters SONAR, there are 200 isolation windows, and depending on the isolation window width, this may result in as many as 10-12 isolation windows. I have mocked this up in excel, here is screenshot:

image

in the upper block i.w.center is the center of the mass window, lmz and umz are lower and upper mz bounds. In practice there are 200 of these (under the waters implementation.

In the lower block, the left column lists a possible precursor mass, and the TRUE/FALSE matrix displays whether that mass would be transmitted in each cell. the pink/red cells would be. I am currently combining all the pink cells for a given precursor, at the apex of the chromatographic precursor peak. i.e. the three scans at the center of the MS1 peak are selected. in this example there are 8 DIA MS2 (Sonar) scans that pass the precursor, so there are a total of 24 MS2 scans. The combineSpectra function enables me to combine all of these which should be a comprehensive list of all the plausible product ions, plus alot of product ions from other precursors. getEIC will enable me to look just at the ions of interest in a targeted manner - no need to do full non-targeted peak detection. I am assuming this would be quite a bit faster than using centWave on the full non-targeted mass list.

I do not think that i can use filterMzRange - looking at the source code the mz vector (or matrix) is converted internally to range(mz) to return a length 2 vector.

As it stands, i think i see a few paths forward for what i am trying to do:

  1. ask you to modify the code return exactly one numeric value (or NA) value for each mass and each spectra retention time (and/or isolation window). I am still trying to navigate the functions - i would be happy to look at the code that actually is finding the values to see if i can contribute - i just am not sure where to find it! is it in this page somwhere? https://github.com/rformassspectrometry/Spectra/blob/master/R/Spectra.R
  2. ask for a function to rapidly convert a spectra object to an xcmsRaw or xcmsSet object
  3. read the data multiple times to enable separate processing in xcms and Spectra for the various steps. maybe this can be done instead by using the extraction tools within XCMS/MSnbase, and converting individual spectra to Spectra class objects for processing...

I hope this clarifies - maybe it doesn't. Basically, i think that all the functionality i would need is probably already built into some combination of XCMS/MSnbase/Spectra, but the object class structures are not fully compatible.

cbroeckl commented 3 years ago

I have spent quite a bit of time navigating the filterMzValues function. I must say that w0w - the organizational effort here is pretty amazing. The amount of cross referencing across packages and functions is impressive. That said, i am still somewhat at a loss for exactly what i need to do to help get this functionality built - the cross-referencing and functions calling functions has me pretty turned around. My understanding to this point:

Spectra::filterMzValues function takes as input a numeric mz vector and ppm tolerance.

  1. Sorts mz internally
  2. Uses addProcessing(.peaks_filter_mz_value). .peaks_filter_mz_value takes as input sorted mz values and ppm tolerance. Keep is set to ‘keep’, which his TRUE by default. If ‘keep’ is TRUE, keep_or_remove is set to ‘select’.
  3. .peaks_filter_mz_values does the work. x is a mas spectrum. Mz is the sorted mz values from above. Keep is set to TRUE by default.
  4. The ‘closest’ function finds the index of the closest peak/peaks. mz is the sorted mz value from above, which I believe is ‘x in the closest function input. ‘table’ is the search spectrum’s mz values. Duplicates is set to ‘closest’ by default.
  5. What is returned by the closest function is dictated by the ‘duplicates’ option, which has to be one of ‘keep’, ‘closest’, or ‘remove’. These behaviors are defined by a call to one of three C functions form the MsCoreUtils package: C_closest_dup_keep, C_closest_dup_closest, C_closest_dup_remove: https://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c
  6. ‘intensity’ function is used to retrieve intensities. MSnbase::intensity - setMethod("intensity", "Spectrum", function(object) object@intensity)

I think, from what I can tell, that there are at least two functions that would require editing:

A. 'closest' function: https://github.com/rformassspectrometry/MsCoreUtils/blob/master/R/matching.R This function utilizes C code for the meaningful steps. Ultimately, it is selecting the closest mass value to the target mass, but only one value. Ideally, this would select all values within the tolerance value. I cannot write C code - while i could try to read the existing code and modify, it is likely to fail miserably. near the end of this function is the following section, with an added and edited line of R code:

  switch(duplicates[1L],
         "keep" = .Call(
           "C_closest_dup_keep",
           as.double(x), as.double(table),
           as.double(tolerance),
           as.integer(nomatch)
         ),
         "closest" = .Call(
           "C_closest_dup_closest",
           as.double(x), as.double(table),
           as.double(tolerance),
           as.integer(nomatch)
         ),
         "remove" = .Call(
           "C_closest_dup_remove",
           as.double(x), as.double(table),
           as.double(tolerance),
           as.integer(nomatch)
         ),
         "keepAll" = which(abs(x - table)/x*1000000 < ppm),                                       ######  addition
         stop("'duplicates' has to be one of \"keep\", \"closest\" ", \"keepAll\",         ######  edit
              "or \"remove\".")
  )

I did try this locally and a vector of integers is returned. This has been done in R code, of course - presumably C was used for speed?

B. .peaks_filter_mz_value function: https://github.com/rformassspectrometry/Spectra/blob/master/R/peaks-functions.R

    idx <- closest(mz, x[, "mz"], tolerance = tolerance, ppm = ppm,
                   duplicates = "keepAll", .check = FALSE)              #####  the 'duplicates' would need to be set (or settable) to 'keepAll'

This returns a vector of indices. if Mspectrum is our mass spectrum after filtering for only the mass range of interest, sum(Mspectrum[idx,"intensity"]) is what we want.

However, I am having a bit of trouble determining how the 'intensity' function works. from what i can tell:

  1. ‘intensity’ function is used to retrieve intensities. Spectra::intensity - NumericList(.peaksapply(object, FUN = function(z, ...) z[, 2], ...), compress = FALSE). so this is using the peaksapply function to apply the intensity function over all spectra.
  2. Spectra::.peaksapply is a function with arguments: .peaksapply <- function(object, FUN = NULL, ..., spectraVariables = .processingQueueVariables(object), f = dataStorage(object), BPPARAM = bpparam()). I assume the FUN here is going to be 'intensity'.
  3. .processingQueueVariables <- function(x) { if (.hasSlot(x, "processingQueueVariables")) x@processingQueueVariables else c("msLevel", "centroided")}. And now i am lost - as the spectrum object does not have a slot called 'intensity'.

so 'intensity' is calling apply functions to return data - i just can't figure out from where this data comes. I have traced this far and do not see anything actually referencing individual spectra. Any guidance is appreciated.

jorainer commented 3 years ago

@cbroeckl , first of all sorry for my very late reply - was on holidays and then this issue somehow got buried under a lot of other issues. I try to catch up.

Regarding the previous post: note that there is now the chromatogram function to extract (raw) EICs from an OnDiskMSnExp or XCMSnExp - if possible I would use that one instead of the rawEIC function. At present we did not implement any functionality into Spectra that allows to extract chromatographic data from it.

Regarding the last post: Ad A) I'm not sure if filterMzValues is what you want to use. That one is expected to keep/remove peaks that are close to one or more specified m/z values. Maybe you want to use the filterMzRange instead? that one allows to specify a lower and upper m/z value and keeps all peaks with an m/z within that range. I see however your point with keeping/removing not only the closest peak but all peaks matching the provided m/z value(s) in filterMzValues. Can you please open a new issue in Spectra requesting that? That way I don't forget it and can work on it once I find a little more time.

Ad B) yes, I know it's rather complicated ;) please use the intensity or the peaksData functions instead of digging into the internals. For these functions we guarantee that any potentially additional data processing is applied before the data is returned. The intensity function returns a list where each element is the numeric vector with the intensity values of one spectrum. peaksData on the other hand returns you a list of 2-column matrix (mz and intensity values). Calling sum on the list of intensities should give you what you want... I hope.

cbroeckl commented 3 years ago

@jorainer - thanks for the tips, and no problem on the delay, i was pulled away from this for a time too. i will try your recommended approach!