lgatto / Spectrum

Spectrum Infrastructure for Mass Spectrometry Data
2 stars 1 forks source link

Peak annotations #1

Open lgatto opened 5 years ago

lgatto commented 5 years ago

We need to define how to best define peak annotations, keeping in mind that these need to work for profile and centroided spectra. The current test implementation is very loose, and only defines lists of peaks and their corresponding annotations.

I suggest start by we defining needs and use cases and move to implementation details thereafter.

Cc @jorainer @meowcat

meowcat commented 5 years ago

I wonder whether it is wise to have m/z values in peaks, since this means they need to be matched to mz by matching the floating point number. What do you think about using vector ids instead? ie.

sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
                                   c(2,3,4) ),
                      annotations = list("a5", "z3"))
pa 
# corresponding to 100 -> a5, (136.1, 136.3, 136.9) -> z3

It's less convenient for direct understanding, but more database-like...

Wrt use cases, I'm already wondering whether we could want more than one PeakAnnotation-type listitem per spectrum. (I will follow up with details will follow for better understanding. It can be implemented with a single item but might be more canonical and useful with multiple.)

lgatto commented 5 years ago

So rather than peak m/z, peak indices. Yes, I think that's better indeed.

meowcat commented 5 years ago

Yes and no, because it introduces another problem... it would mean we have to keep track of the peaks.

# Starting as before
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
                                   c(2,3,4) ),
                      annotations = list("a5", "z3")
sp$peakAnnotation <- pa
# What happens if someone changes the peaklist?
# All pseudocode just for the concept, don't worry about datatypes etc
peaks <- sp[c("intensity", "mz")]
peaks <- peaks[peaks$mz > 120,]
sp[c("intensity", "mz")] <- peaks

Now the annotations point into nowhere, so what to do? Invalidate them?

On the other hand, if we run a "recalibration" operation on the peaks, we face the opposing problem, i.e. the peaks would remain mapped if using indices, but lose mapping if using m/z.

# using the m/z mapped annotations again
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(100,
                                   c(136.1, 136.3, 136.9),
                      annotations = list("a5", "z3"))
sp$peakAnnotation <- pa
# Let's run a recalibration
mz <- sp$mz
mzRecal <- predict(recalibrationModel, mz)
# mzRecal is now:
# c(100.2, 136.3, 136.5, 137.1, 200.1)
sp$mz <- mzRecal

Here indices would have retained the mapping, where m/z mapping would lose it.

lgatto commented 5 years ago

The same issue holds weather we store indices or m/z - any operation that modifies the m/z values will run the risk of invalidating the peak annotations. We will have to take care of this possibility, either by updating the annotations accordingly, or, when not possible at all, by warning the user that they are potentially running a destructive operation.

Having said that, in many use cases, spectra won't be changed once they get annotations, of some (most?) annotations will become automated, so that the these issues will only plague a minority of users.

meowcat commented 5 years ago

Up to now we always talked about a single slot called peakAnnotation. Is there any reason why we want that instead of using the PeakAnnotation for possibly multiple slots?

sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
peptideAnnotation <- PeakAnnotation(peaks = list(100,
                                   c(136.1, 136.3, 136.9),
                      annotations = list("a5", "z3"))
sp$peptideAnnotation <- peptideAnnotation

centroidMassAnnotation <- PeakAnnotation(peaks = list(100, c(136.1, 136,3, 136.9), 200),
   annotations = list(100, 136.5, 200))
sp$centroidMassAnnotation <- centroidMassAnnotation
# Centroiding is not the best example, but I hope you get the idea

Further, automated annotations could be done like this:

annotatePeptides <- function(sp, threshold)
{
 # some magic
 mz <- filterThreshold(mz)
 peptides <- magicFindPeptides(mz)
 return(peptides)
}
sp$peptideAnnotations <- AutoPeakAnnotation(annotatePeptides, threshold=0.2)
lgatto commented 5 years ago

Because (1) when you want to get all annotations of all your spectra, you don't want to first look for them and (2) these annotations will live as a single spectraData PeakAnnotations variable.

meowcat commented 5 years ago

See my PR #6 for a use case as a basis for further discussion, where I provide two (IMO) useful peak annotators for small molecules.

How do we best use peak annotations? In my example, we have

pa1 <- annotateFormula(sp, ...) 
# pa1@annotation elements are list(mzCalculated, formula, dppm, dbe)
pa2 <- annotateCfmId(sp, ...) 
# pa2@annotation elements are list(mzCalculated, smiles)

What to do with multiple annotation types?

I will be happy either way.