Closed jorainer closed 4 years ago
A very nice example for the use of binning in compareSpectra
from @sgibb:
x <- cbind(1:10, 1:10)
y <- cbind(9:20, 9:20)
Spectra:::.peaks_compare_intensities(x, y)
# [1] 1
xb <- bin(x[, 2], x[, 1], size=1, breaks=1:20)
yb <- bin(y[, 2], y[, 1], size=1, breaks=1:20)
xb <- cbind(xb$mids, xb$x)
yb <- cbind(yb$mids, yb$x)
Spectra:::.peaks_compare_intensities(xb, yb)
# [1] -0.552357
So, without binning the correlation is perfect because only the two overlapping peaks are compared. With binning the correlation is totally different. The binning assigns a value of 0
to each peak that is not present in one of the two. My gut feeling tells me that assuming an intensity of 0
for a peak that was not there (because absent, or because the m/z was simply outside the measurement range) is not OK, but I might be wrong here.
I think this is a very important issue that we need to solve. Would also be nice to get some feedback from e.g. @sneumann @michaelwitting @meowcat
An idea might be penalise the comparison results somehow, e.g. by the number of peaks used for it. In the example above it would be cor * common/all_peaks
= 1 * 2/20
.
That would be one option - only, the result would then no longer be e.g. a valid correlation coefficient. I wonder if there is no literature on that?
Hm; I am not completely sure what I am being asked :)
For small molecule HRMS we typically don't want to bin, but align the spectra using a ppm window for matching, as noted above. The typical match that is used in NIST MS (originally for GC, but it propagated to the LC world) and therefore everyone copies it, is the modified cosine.
This was originally written for unit mass resolution and therefore a priori "binned" spectra. As I understand, they work the same for HRMS spectra. NIST shows forward and reverse match.
For the forward match (i.e. does the spectrum match the library?), absent peaks in either spectrum get 0, i.e. the spectra are aligned first, and on both sides the missing values are 0.
For reverse match (i.e. does the library match the spectrum?), only peaks present in the library spectrum are considered, i.e. absent peaks in the query spectrum are 0, but absent peaks in the library spectrum are ignored. So it would be nice to have this option (or more generally, the option to do either left, right, inner or outer join.) As far as the motivations go, I don't think the mass range is the most important reason for this, more important to me are chimeric spectra, which cannot be detected equally easy...
GNPS uses the modified cosine too AFAIK, and also gives the option of a minimum # of matched peaks. I don't think there is reverse match as a general option.
Now, is this a good idea? This can be discussed, but it's one way to do it that is already in use... The ability of using other functions as in the MSnbase compareSpectra with fun=
is certainly desirable, to implement promising alternatives like the XRank...
This is very helpful! Thanks @meowcat !
I like also the notion with the joins, that's something I can understand (and which we can add as an argument). If I get it right then the:
x
and y
are considered, if a peak in x
is not present in y
it is added to y
with a 0
intensity and the same is done for peaks present in y
but not in x
.y
and, if one of them is missing in x
add them to x
with intensity 0
.x
and y
without adding any.We could then have an option join
to compareSpectra
that takes care of adding (or not) peaks to either x
and/or y
and then the FUN
can be applied - whatever that is.
I agree with @meowcat . Binning is okay for low-resolution, but high resolution requires alignment within a certain ppm or Da window. It would be nice to have two functions like bin_spectra
, which bins x
and y
on the same bins and align_spectra
which aligns x
against y
within a certain error range.
These functions can be then also re-used by people that want to implement their own matching functions. They can return the binned or aligned spectra that people can then use in their costum FUN
.
compareSpectra
could have an argument like align = TRUE
, then spectra are aligned, otherwise binned.
PS: Here is a function for alignment I'm using. Partially copied from OrgMassSpecR
.
https://github.com/michaelwitting/masstrixR/blob/master/R/Utils_spectrumFunctions.R
There are more open questions that might influence where what is done.
For example, the hybrid search is gaining traction (which is, to my understanding, de facto more or less augmenting the spectrum by its loss spectrum, i.e. for every mz_i, intensity_i adding an extra peak with mz = precursorMz-mz_i or 0-mz_i, intensity = intensity_i, before the match).
So if FUN
wants to do this, the join must be done in FUN
and FUN
needs to take it as a parameter, because doing the join before passing the spectra to FUN
will eliminate all potential NL matches. However Spectra
should probably provide a join operation so it isn't reimplemented in every possible function...
(Note also https://github.com/dutchjes/MSMSsim for some alignment and comparison scoring functions, extended from the OrgMassSpecR originals)
Agreed! bin_spectra
and align_spectra
should only do the alignment or binning, everything else should go to FUN
@michaelwitting I am wondering what to you mean with alignment here. I mostly worked with low-res MALDI spectra. There alignment was based on features (mostly peaks/shoulders etc) and aimed to move and stretch/shrink the whole spectrum to increase the similarity (how ever you measure that). binning was used to create identical mz values with maybe aggregated intensities.
The approach you have implemented aims for a one-to-one matching of mz values. While the closest is chosen it doesn't mean that the best match between two spectra is reached (e.g. if you have data points every 0.2 Da and your spectrum is shifted by 1 Da it won't be aligned properly after your alignment approach). That means you don't align the shape of the spectra or make them more equal but just ensure that you can compare them more easily. It is more or less a direct matching or inner join with a tolerance (or a binning with the restriction that just one data point per bin is allowed).
I don't have any experienced in aligning high-res spectra. Maybe that's not a real problem there (because I guess a shift of 1 Da should not happen, but was usual in low-res MALDI).
IMHO the definition/usage would be the following:
Or, do we talk about different things, here?
Thanks all for the valuable input!
As far as I understood @michaelwitting was talking about 3) mapping/matching - it's just hard to find the right word for that. I wouldn't call that matching or mapping either. The right word would be join two spectra or eventually collocate (not colocate)?
cor
is a really good choice...I guess a relevant question is, should compareSpectra
be more than a rudimentary shell?
Yes, matching could be the better word. However, we alway stick to alignment, somehow... In my code I'm comparing against library spectra, which ideally have a very low or no error in their m/z axis.
I would consider compareSpectra
as @meowcat stated it as shell (maybe with dot product as standard function), that can be extended.
A few functions can be delivered with the Spectra package a few others can be implemented in other packages.
I guess my question is rather just how shelly it should be, i.e. what tasks can be generalized rather than delegated to single implementations...
My idea was that compareSpectra
takes care of providing the infrastructure for the comparison-function (argument FUN
). That function can be any function that takes two numerical vectors, performs some sort of calculation on that and returns the result (a numeric(1)
). This means that compareSpectra
would have to ensure to provide the correct vectors of intensities (i.e. same length, values matched). This task could be controlled with a parameter join
that defines how values are matched ("outer"
, "left"
, "right"
, "inner"
). If Spectra
are binned beforehand it would work too.
Only thing to consider would be if there are more specific/complicated comparisons - but for these we could implement their own function. In the end we might have to implement some things anyway in C...
But I would do it step by step. Let's first agree how to match peaks. I would start to implement something, add examples, vignette etc and ask your feedback on that. Then we can add specific functions to compare spectra, and there we might anyway want to ask for your contributions @meowcat and @michaelwitting (if you want - you'll also get some of the new stickers 😉 ).
I understand, but how would you handle hybrid search in this case? (Of course, there is the option that the user generates loss spectra for this beforehand, and then uses regular search.) Another question, would graph comparison methods fall outside the scope of this function?
I think we will have to find a tradeoff between performance and flexibility. That's why I think we should start with the simple cases first, see how that works - and what is not possible. Then we can start improving from there. I find it always very difficult to plan everything into the last detail beforehand.
It would also not hurt to have different scenarios and functions. We could then compare performances, pitfalls etc. And I would be very interested to do also graph comparisons - whether with this or another function I don't know - but always happy to accept contributions :)
I will leave this here for reference, in case I (or someone else) ever get to work on it: https://alg.cs.umt.edu/media/patrick-kreitzberg-masters-thesis.pdf
Thanks @meowcat !
After some discussions with @tnaake I think we might want to have let the user decide (or define) both 1) the function to map peaks between the spectra and 2) the function to compare the intensities.
Possibilities for the mapping function are:
igraph
-based approach peaks being nodes, difference in m/z betwen peaks the edges.I opened issue #49 that explains the igraph
-based approach (suggestions and comments welcome)
Hi, as discussed with @jorainer , for what it's worth, here is a function I wrote a while back that does something along those lines and compares spectra based on m/z or ppm tolerances. It is not optimized (especially not for cases where one peak in one spectrum matches multiple peaks in the other, the handling for which I only just added). I use it to generate molecular networks in Metaboseek :), very interested to see what functions come out of this discussion!
I think we can close this issue as everything (?) was implemented. Feel free to reopen if needed.
I had some thoughts on the
compareSpectra
methdod (that also led to PR #45): I would suggest thecompareSpectra
to use peak matching with e.g.match.closest
by default before applying the method to compare the spectra (on only the matching peaks). This would work also if the data was binned before. But, at least for the application in metabolomics, binning is not an option. People rather use a certainppm
to match peaks and compare spectra.Any thoughts @lgatto and @sgibb ?