Closed jorainer closed 3 years ago
I believe this small additional functionality to be highly valuable, the only thing I'm not completely sure about is the name of the additional parameter. I had remove
but changed that to keep
(to avoid having parameters with a negative meaning) - still there could also be other options/names. @sgibb @lgatto any thoughts?
Great, indeed, and I'm happy with the keep = TRUE
argument.
Thanks @lgatto for the feedback. Then I'll merge and push to BioC
Stumbled on this thread while looking for what appears to be this functionality. I am using the filterMzValues function, and if i submit a vector of mz values, the resultant list of vectors in the @listData have different lengths. would keep = TRUE result in a list of vectors of exactly the same length? I have 27 retention times in this filtered Spectra object.
> dms2b <- filterMzValues(dms2, mzs, ppm = 20)
> rt <- rtime(dms2b)
> int <- Spectra::intensity(dms2b)
> int
NumericList of length 27
[[1]] 0 0 28 0 0 0 0 0
[[2]] 0 12 0
[[3]] 23 28 0 0 0 0 12 0 52 0
[[4]] 61 0 0 0 0 39 0 0 0 164 0 0
[[5]] 0 30 25 0 0 33 80 0 10 21 0 396 0 0 0
[[6]] 48 0 0 29 0 0 27 0 0 22 16 19 0 10 92 0 0 0 251 0 0 0
[[7]] 0 12 138 0 0 0 49 30 54 16 0 0 0 0 14 0 0 0 413 0 0 16 0 0
[[8]] 0 58 0 24 0 0 0 31 0 0 0 10
[[9]] 53 24 56 0 0 0 14
[[10]] 0 0 0 0 0 0 0 0 0
...
<17 more elements>
The filterMzValues
will always reduce the number of peaks in each spectrum. With keep = TRUE
only the peaks matching the provided m/z values will be retained (all others will be dropped) and with keep = FALSE
the opposite is done: all peaks matching the provided m/z values will be removed. So, we're not simply setting the intensity of the to-be-removed peaks to 0 but are actually removing them (thus you get different lengths of the intensity
numeric vectors within the list
that is returned by intensity
, the length of the list
does not change.
Would you have expected these elements to be set to 0, @cbroeckl ? (This was indeed was was implemented in MSnbase
, with an additional function to remove 0 intensity peaks.) Would there be strong arguments in favour of one or the other solutions that would fit both profile and centroided data?
@jorainer and @lgatto Thanks for the reply. I realize i wasn't very clear in my intent and data input. I am trying to retried relative raw data from SONAR MS2 spectra - scanningSwath is essentially the same - basically the quad is scanning across a given mass range and each TOF push represents a mass center/isolation width. I am playing with some methods to try to process these data to generate clean MS2 spectra. One method is to isolate all the MS2 spectra which pass the precursor ion of interest at the max MS1 signal scan or scans - in this case it is about 9 MS2 spectra for each of three MS1 scan. Then, combine those spectra and with/out some filtering select the masses in that spectraand retreive EICs which can be used for additional filtering based on EIC properties.
So i was trying to use filterMzValues to retrieve EICs for (in this test) 100 mass ranges - mzs, in my example is vector of mass values, with ppm specified in the function. It is clear that i am retrieving data for all spectra, but the truncated length of each element of the 27 spectra list means I am not retrieving data for all the mass ranges. From what i can tell, there is no index returned which would be able to tell me which mass range each value corresponds to, which means i would need to perform this one mass range at a time to ensure i can index everything appropriately.
so basically what i am after is a rectangular matrix of dimensions nSpectra x nMassRanges. NA values are fine, or zero values, when there is no signal detected in that mass range. It is certainly possible that i am misusing this function - there may be another in this or one of the other packages which can retrieve a complete rectangular matrix of EIC values. This would be applicable for either profile or centroid data. Thanks to you both for all the effort.
{{EDIT}}: what would be even better is a filter method that can return a rectangular array of multiple dimensions. If i filter on rt, mz, and IMS drift, a 3d array is returned instead of a matrix. Each cell has a value (i would prefer 0 to NA, but either could work) representing the total signal within those ranges. I think this would be a valuable base-level feature within the Spectra package. I am still figuring out how Spectra interacts with XCMS and MSnbase - maybe one of those tools accepts a Spectra object (XCMS::chromatogram does not appear to).
I'm afraid I did not completely understand it, but if this data is somewhat similar to (standard) SWATH, then there should be an additional spectra variable that you could use for filtering (I think it is called isolationWindowLowerMz and isolationWindowUpperMz or similar). That would allow you to select all spectra that are generated within a certain isolation window. So, in xcms
I'm first getting all MS2 spectra for a certain isolation window and then I'm running chromatographic peak detection on that set of spectra (treating them like MS1 data) - don't know if your approach would be somewhat similar?
Note also that maybe the filterMzRange
might be a more suitable function? that one keeps all peaks that are within the provided m/z range (while filterMzValues
keeps only peaks matching the provided m/z value(s)).
The thing with having rectangular MS data or have 0 or NA if something is not present: I find that always very tricky in MS data, because some instruments (like Sciex) simply don't report anything (no m/z) if they don't measure a signal there. So, a rectangular set of Spectra would be very difficult to get (what I mean is that all spectra have exactly the same number of peaks and simply a 0 if no signal was measured - also because the m/z will never be identical across Spectra).
Maybe an actual meeting would be nice for me to better understand what exactly you're looking for - eventually we could then implement something if it's not already possible with current Spectra
?
Thanks - unfortunately i am pretty terrible at trying to explain computational steps in words.... Sorry for that.
https://rdrr.io/bioc/xcms/man/rawEIC-methods.html
Basically, i am looking for a way to do this on a Spectra object - I am using the combineSpectra tools in Spectra, but the rawEIC (or getEIC) functions only work on on-disk files through mzR. rawEIC doesn't take a Spectra object as input. It doesn't seem efficient to read the raw data once for XCMS and a second time for Spectra, but i guess that could be a workaround.
I could perform peak detection on the subset rtRange, as you and Michael Witting have applied it for Swath data: https://bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms-lcms-ms.html. However, processing time is already long using this approach. I would like to be able to use this for every detected peak (ideally), so speeding things up will be necessary. Further, scanningSwath, or SONAR, is like Swath, except there are multiple overlapping DIA windows - in the case of Waters SONAR, there are 200 isolation windows, and depending on the isolation window width, this may result in as many as 10-12 isolation windows. I have mocked this up in excel, here is screenshot:
in the upper block i.w.center is the center of the mass window, lmz and umz are lower and upper mz bounds. In practice there are 200 of these (under the waters implementation.
In the lower block, the left column lists a possible precursor mass, and the TRUE/FALSE matrix displays whether that mass would be transmitted in each cell. the pink/red cells would be. I am currently combining all the pink cells for a given precursor, at the apex of the chromatographic precursor peak. i.e. the three scans at the center of the MS1 peak are selected. in this example there are 8 DIA MS2 (Sonar) scans that pass the precursor, so there are a total of 24 MS2 scans. The combineSpectra function enables me to combine all of these which should be a comprehensive list of all the plausible product ions, plus alot of product ions from other precursors. getEIC will enable me to look just at the ions of interest in a targeted manner - no need to do full non-targeted peak detection. I am assuming this would be quite a bit faster than using centWave on the full non-targeted mass list.
I do not think that i can use filterMzRange - looking at the source code the mz vector (or matrix) is converted internally to range(mz) to return a length 2 vector.
As it stands, i think i see a few paths forward for what i am trying to do:
I hope this clarifies - maybe it doesn't. Basically, i think that all the functionality i would need is probably already built into some combination of XCMS/MSnbase/Spectra, but the object class structures are not fully compatible.
I have spent quite a bit of time navigating the filterMzValues function. I must say that w0w - the organizational effort here is pretty amazing. The amount of cross referencing across packages and functions is impressive. That said, i am still somewhat at a loss for exactly what i need to do to help get this functionality built - the cross-referencing and functions calling functions has me pretty turned around. My understanding to this point:
Spectra::filterMzValues function takes as input a numeric mz vector and ppm tolerance.
I think, from what I can tell, that there are at least two functions that would require editing:
A. 'closest' function: https://github.com/rformassspectrometry/MsCoreUtils/blob/master/R/matching.R This function utilizes C code for the meaningful steps. Ultimately, it is selecting the closest mass value to the target mass, but only one value. Ideally, this would select all values within the tolerance value. I cannot write C code - while i could try to read the existing code and modify, it is likely to fail miserably. near the end of this function is the following section, with an added and edited line of R code:
switch(duplicates[1L],
"keep" = .Call(
"C_closest_dup_keep",
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
"closest" = .Call(
"C_closest_dup_closest",
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
"remove" = .Call(
"C_closest_dup_remove",
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
"keepAll" = which(abs(x - table)/x*1000000 < ppm), ###### addition
stop("'duplicates' has to be one of \"keep\", \"closest\" ", \"keepAll\", ###### edit
"or \"remove\".")
)
I did try this locally and a vector of integers is returned. This has been done in R code, of course - presumably C was used for speed?
B. .peaks_filter_mz_value function: https://github.com/rformassspectrometry/Spectra/blob/master/R/peaks-functions.R
idx <- closest(mz, x[, "mz"], tolerance = tolerance, ppm = ppm,
duplicates = "keepAll", .check = FALSE) ##### the 'duplicates' would need to be set (or settable) to 'keepAll'
This returns a vector of indices. if Mspectrum is our mass spectrum after filtering for only the mass range of interest, sum(Mspectrum[idx,"intensity"]) is what we want.
However, I am having a bit of trouble determining how the 'intensity' function works. from what i can tell:
so 'intensity' is calling apply functions to return data - i just can't figure out from where this data comes. I have traced this far and do not see anything actually referencing individual spectra. Any guidance is appreciated.
@cbroeckl , first of all sorry for my very late reply - was on holidays and then this issue somehow got buried under a lot of other issues. I try to catch up.
Regarding the previous post:
note that there is now the chromatogram
function to extract (raw) EICs from an OnDiskMSnExp
or XCMSnExp
- if possible I would use that one instead of the rawEIC
function. At present we did not implement any functionality into Spectra
that allows to extract chromatographic data from it.
Regarding the last post:
Ad A) I'm not sure if filterMzValues
is what you want to use. That one is expected to keep/remove peaks that are close to one or more specified m/z values. Maybe you want to use the filterMzRange
instead? that one allows to specify a lower and upper m/z value and keeps all peaks with an m/z within that range.
I see however your point with keeping/removing not only the closest peak but all peaks matching the provided m/z value(s) in filterMzValues
. Can you please open a new issue in Spectra
requesting that? That way I don't forget it and can work on it once I find a little more time.
Ad B) yes, I know it's rather complicated ;) please use the intensity
or the peaksData
functions instead of digging into the internals. For these functions we guarantee that any potentially additional data processing is applied before the data is returned. The intensity
function returns a list
where each element is the numeric vector with the intensity values of one spectrum. peaksData
on the other hand returns you a list
of 2-column matrix
(mz and intensity values). Calling sum
on the list of intensities should give you what you want... I hope.
@jorainer - thanks for the tips, and no problem on the delay, i was pulled away from this for a time too. i will try your recommended approach!
remove
tofilterMzValues
to allow removing instead of keeping matching peaks.