rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
35 stars 25 forks source link

compareSpectra,Spectra,Spectra #46

Closed jorainer closed 4 years ago

jorainer commented 5 years ago

I had some thoughts on the compareSpectra methdod (that also led to PR #45): I would suggest the compareSpectra to use peak matching with e.g. match.closest by default before applying the method to compare the spectra (on only the matching peaks). This would work also if the data was binned before. But, at least for the application in metabolomics, binning is not an option. People rather use a certain ppm to match peaks and compare spectra.

Any thoughts @lgatto and @sgibb ?

jorainer commented 5 years ago

A very nice example for the use of binning in compareSpectra from @sgibb:

x <- cbind(1:10, 1:10)
y <- cbind(9:20, 9:20)

Spectra:::.peaks_compare_intensities(x, y)
# [1] 1

xb <- bin(x[, 2], x[, 1], size=1, breaks=1:20)
yb <- bin(y[, 2], y[, 1], size=1, breaks=1:20)
xb <- cbind(xb$mids, xb$x)
yb <- cbind(yb$mids, yb$x)

Spectra:::.peaks_compare_intensities(xb, yb)
# [1] -0.552357

So, without binning the correlation is perfect because only the two overlapping peaks are compared. With binning the correlation is totally different. The binning assigns a value of 0 to each peak that is not present in one of the two. My gut feeling tells me that assuming an intensity of 0 for a peak that was not there (because absent, or because the m/z was simply outside the measurement range) is not OK, but I might be wrong here.

I think this is a very important issue that we need to solve. Would also be nice to get some feedback from e.g. @sneumann @michaelwitting @meowcat

sgibb commented 5 years ago

An idea might be penalise the comparison results somehow, e.g. by the number of peaks used for it. In the example above it would be cor * common/all_peaks = 1 * 2/20.

jorainer commented 5 years ago

That would be one option - only, the result would then no longer be e.g. a valid correlation coefficient. I wonder if there is no literature on that?

meowcat commented 5 years ago

Hm; I am not completely sure what I am being asked :)

For small molecule HRMS we typically don't want to bin, but align the spectra using a ppm window for matching, as noted above. The typical match that is used in NIST MS (originally for GC, but it propagated to the LC world) and therefore everyone copies it, is the modified cosine.

This was originally written for unit mass resolution and therefore a priori "binned" spectra. As I understand, they work the same for HRMS spectra. NIST shows forward and reverse match.

For the forward match (i.e. does the spectrum match the library?), absent peaks in either spectrum get 0, i.e. the spectra are aligned first, and on both sides the missing values are 0.

For reverse match (i.e. does the library match the spectrum?), only peaks present in the library spectrum are considered, i.e. absent peaks in the query spectrum are 0, but absent peaks in the library spectrum are ignored. So it would be nice to have this option (or more generally, the option to do either left, right, inner or outer join.) As far as the motivations go, I don't think the mass range is the most important reason for this, more important to me are chimeric spectra, which cannot be detected equally easy...

GNPS uses the modified cosine too AFAIK, and also gives the option of a minimum # of matched peaks. I don't think there is reverse match as a general option.

Now, is this a good idea? This can be discussed, but it's one way to do it that is already in use... The ability of using other functions as in the MSnbase compareSpectra with fun= is certainly desirable, to implement promising alternatives like the XRank...

jorainer commented 5 years ago

This is very helpful! Thanks @meowcat !

I like also the notion with the joins, that's something I can understand (and which we can add as an argument). If I get it right then the:

We could then have an option join to compareSpectra that takes care of adding (or not) peaks to either x and/or y and then the FUN can be applied - whatever that is.

michaelwitting commented 5 years ago

I agree with @meowcat . Binning is okay for low-resolution, but high resolution requires alignment within a certain ppm or Da window. It would be nice to have two functions like bin_spectra, which bins x and y on the same bins and align_spectra which aligns x against y within a certain error range. These functions can be then also re-used by people that want to implement their own matching functions. They can return the binned or aligned spectra that people can then use in their costum FUN. compareSpectra could have an argument like align = TRUE, then spectra are aligned, otherwise binned.

michaelwitting commented 5 years ago

PS: Here is a function for alignment I'm using. Partially copied from OrgMassSpecR. https://github.com/michaelwitting/masstrixR/blob/master/R/Utils_spectrumFunctions.R

meowcat commented 5 years ago

There are more open questions that might influence where what is done.

For example, the hybrid search is gaining traction (which is, to my understanding, de facto more or less augmenting the spectrum by its loss spectrum, i.e. for every mz_i, intensity_i adding an extra peak with mz = precursorMz-mz_i or 0-mz_i, intensity = intensity_i, before the match).

So if FUN wants to do this, the join must be done in FUN and FUN needs to take it as a parameter, because doing the join before passing the spectra to FUN will eliminate all potential NL matches. However Spectra should probably provide a join operation so it isn't reimplemented in every possible function...

(Note also https://github.com/dutchjes/MSMSsim for some alignment and comparison scoring functions, extended from the OrgMassSpecR originals)

michaelwitting commented 5 years ago

Agreed! bin_spectra and align_spectra should only do the alignment or binning, everything else should go to FUN

sgibb commented 5 years ago

@michaelwitting I am wondering what to you mean with alignment here. I mostly worked with low-res MALDI spectra. There alignment was based on features (mostly peaks/shoulders etc) and aimed to move and stretch/shrink the whole spectrum to increase the similarity (how ever you measure that). binning was used to create identical mz values with maybe aggregated intensities.

The approach you have implemented aims for a one-to-one matching of mz values. While the closest is chosen it doesn't mean that the best match between two spectra is reached (e.g. if you have data points every 0.2 Da and your spectrum is shifted by 1 Da it won't be aligned properly after your alignment approach). That means you don't align the shape of the spectra or make them more equal but just ensure that you can compare them more easily. It is more or less a direct matching or inner join with a tolerance (or a binning with the restriction that just one data point per bin is allowed).

I don't have any experienced in aligning high-res spectra. Maybe that's not a real problem there (because I guess a shift of 1 Da should not happen, but was usual in low-res MALDI).

IMHO the definition/usage would be the following:

  1. binning, create equally sized mz bins and aggregate the intensities, could be used for reduction of the resolution
  2. alignment, increase the similarity of spectra by rough moving/stretching/shrinking of the whole spectrum
  3. matching/mapping/joining, one-to-one assignment of mz values between spectra

Or, do we talk about different things, here?

jorainer commented 5 years ago

Thanks all for the valuable input!

As far as I understood @michaelwitting was talking about 3) mapping/matching - it's just hard to find the right word for that. I wouldn't call that matching or mapping either. The right word would be join two spectra or eventually collocate (not colocate)?

meowcat commented 5 years ago

I guess a relevant question is, should compareSpectra be more than a rudimentary shell?

michaelwitting commented 5 years ago

Yes, matching could be the better word. However, we alway stick to alignment, somehow... In my code I'm comparing against library spectra, which ideally have a very low or no error in their m/z axis.

I would consider compareSpectra as @meowcat stated it as shell (maybe with dot product as standard function), that can be extended.

A few functions can be delivered with the Spectra package a few others can be implemented in other packages.

meowcat commented 5 years ago

I guess my question is rather just how shelly it should be, i.e. what tasks can be generalized rather than delegated to single implementations...

jorainer commented 5 years ago

My idea was that compareSpectra takes care of providing the infrastructure for the comparison-function (argument FUN). That function can be any function that takes two numerical vectors, performs some sort of calculation on that and returns the result (a numeric(1)). This means that compareSpectra would have to ensure to provide the correct vectors of intensities (i.e. same length, values matched). This task could be controlled with a parameter join that defines how values are matched ("outer", "left", "right", "inner"). If Spectra are binned beforehand it would work too.

Only thing to consider would be if there are more specific/complicated comparisons - but for these we could implement their own function. In the end we might have to implement some things anyway in C...

But I would do it step by step. Let's first agree how to match peaks. I would start to implement something, add examples, vignette etc and ask your feedback on that. Then we can add specific functions to compare spectra, and there we might anyway want to ask for your contributions @meowcat and @michaelwitting (if you want - you'll also get some of the new stickers 😉 ).

meowcat commented 5 years ago

I understand, but how would you handle hybrid search in this case? (Of course, there is the option that the user generates loss spectra for this beforehand, and then uses regular search.) Another question, would graph comparison methods fall outside the scope of this function?

jorainer commented 5 years ago

I think we will have to find a tradeoff between performance and flexibility. That's why I think we should start with the simple cases first, see how that works - and what is not possible. Then we can start improving from there. I find it always very difficult to plan everything into the last detail beforehand.

It would also not hurt to have different scenarios and functions. We could then compare performances, pitfalls etc. And I would be very interested to do also graph comparisons - whether with this or another function I don't know - but always happy to accept contributions :)

meowcat commented 5 years ago

I will leave this here for reference, in case I (or someone else) ever get to work on it: https://alg.cs.umt.edu/media/patrick-kreitzberg-masters-thesis.pdf

jorainer commented 5 years ago

Thanks @meowcat !

After some discussions with @tnaake I think we might want to have let the user decide (or define) both 1) the function to map peaks between the spectra and 2) the function to compare the intensities.

Possibilities for the mapping function are:

tnaake commented 5 years ago

I opened issue #49 that explains the igraph-based approach (suggestions and comments welcome)

mjhelf commented 5 years ago

Hi, as discussed with @jorainer , for what it's worth, here is a function I wrote a while back that does something along those lines and compares spectra based on m/z or ppm tolerances. It is not optimized (especially not for cases where one peak in one spectrum matches multiple peaks in the other, the handling for which I only just added). I use it to generate molecular networks in Metaboseek :), very interested to see what functions come out of this discussion!

https://github.com/mjhelf/MassTools/blob/03e6102e0e487c13a4a9ca620adba1aa8ea20963/R/Functions_Spectra_comparison.R#L75

jorainer commented 4 years ago

I think we can close this issue as everything (?) was implemented. Feel free to reopen if needed.