Peak annotations - Githubissues

lgatto commented 5 years ago

We need to define how to best define peak annotations, keeping in mind that these need to work for profile and centroided spectra. The current test implementation is very loose, and only defines lists of peaks and their corresponding annotations.

I suggest start by we defining needs and use cases and move to implementation details thereafter.

Cc @jorainer @meowcat

meowcat commented 5 years ago

I wonder whether it is wise to have m/z values in peaks, since this means they need to be matched to mz by matching the floating point number. What do you think about using vector ids instead? ie.

sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
                                   c(2,3,4) ),
                      annotations = list("a5", "z3"))
pa 
# corresponding to 100 -> a5, (136.1, 136.3, 136.9) -> z3

It's less convenient for direct understanding, but more database-like...

Wrt use cases, I'm already wondering whether we could want more than one PeakAnnotation-type listitem per spectrum. (I will follow up with details will follow for better understanding. It can be implemented with a single item but might be more canonical and useful with multiple.)

lgatto commented 5 years ago

So rather than peak m/z, peak indices. Yes, I think that's better indeed.

meowcat commented 5 years ago

Yes and no, because it introduces another problem... it would mean we have to keep track of the peaks.

# Starting as before
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
                                   c(2,3,4) ),
                      annotations = list("a5", "z3")
sp$peakAnnotation <- pa
# What happens if someone changes the peaklist?
# All pseudocode just for the concept, don't worry about datatypes etc
peaks <- sp[c("intensity", "mz")]
peaks <- peaks[peaks$mz > 120,]
sp[c("intensity", "mz")] <- peaks

Now the annotations point into nowhere, so what to do? Invalidate them?

On the other hand, if we run a "recalibration" operation on the peaks, we face the opposing problem, i.e. the peaks would remain mapped if using indices, but lose mapping if using m/z.

# using the m/z mapped annotations again
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(100,
                                   c(136.1, 136.3, 136.9),
                      annotations = list("a5", "z3"))
sp$peakAnnotation <- pa
# Let's run a recalibration
mz <- sp$mz
mzRecal <- predict(recalibrationModel, mz)
# mzRecal is now:
# c(100.2, 136.3, 136.5, 137.1, 200.1)
sp$mz <- mzRecal

Here indices would have retained the mapping, where m/z mapping would lose it.

lgatto commented 5 years ago

The same issue holds weather we store indices or m/z - any operation that modifies the m/z values will run the risk of invalidating the peak annotations. We will have to take care of this possibility, either by updating the annotations accordingly, or, when not possible at all, by warning the user that they are potentially running a destructive operation.

Having said that, in many use cases, spectra won't be changed once they get annotations, of some (most?) annotations will become automated, so that the these issues will only plague a minority of users.

meowcat commented 5 years ago

Up to now we always talked about a single slot called peakAnnotation. Is there any reason why we want that instead of using the PeakAnnotation for possibly multiple slots?

sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
peptideAnnotation <- PeakAnnotation(peaks = list(100,
                                   c(136.1, 136.3, 136.9),
                      annotations = list("a5", "z3"))
sp$peptideAnnotation <- peptideAnnotation

centroidMassAnnotation <- PeakAnnotation(peaks = list(100, c(136.1, 136,3, 136.9), 200),
   annotations = list(100, 136.5, 200))
sp$centroidMassAnnotation <- centroidMassAnnotation
# Centroiding is not the best example, but I hope you get the idea

Further, automated annotations could be done like this:

annotatePeptides <- function(sp, threshold)
{
 # some magic
 mz <- filterThreshold(mz)
 peptides <- magicFindPeptides(mz)
 return(peptides)
}
sp$peptideAnnotations <- AutoPeakAnnotation(annotatePeptides, threshold=0.2)

lgatto commented 5 years ago

Because (1) when you want to get all annotations of all your spectra, you don't want to first look for them and (2) these annotations will live as a single spectraData PeakAnnotations variable.

meowcat commented 5 years ago

See my PR #6 for a use case as a basis for further discussion, where I provide two (IMO) useful peak annotators for small molecules.

How do we best use peak annotations? In my example, we have

pa1 <- annotateFormula(sp, ...) 
# pa1@annotation elements are list(mzCalculated, formula, dppm, dbe)
pa2 <- annotateCfmId(sp, ...) 
# pa2@annotation elements are list(mzCalculated, smiles)

What to do with multiple annotation types?

Concatenate pa1 and pa2, keep one peakAnnotation per spectrum. Advantage: it is stored away in a single slot. Disadvantage: in a relational database backend, these fields are not directly searchable.
It would probably be useful to have at least an @annotationType slot in PeakAnnotation so the user can get peakAnnotation(sp, type="cfmId") (or whatever the interface).
I believe annotations of a single type are most useful to the users as a DataFrame (with @peaks as a list column as-is, i.e. the centroided annotation is kept as a single row). So we can provide as("DataFrame") or the user can write it himself.
Note that right now, nothing prevents the user from doing
```
sp$peakAnnotation <- pa1
sp$independentPeakAnnotation <- pa2
```
If we want to discourage this, we should offer another solution and/or forbid it.
The other option would be to get rid of the predefined peakAnnotation slot and leave it to the users to make one or multiple. If we opt for this, I would suggest making PeakAnnotation just a DataFrame with mandatory column peaks. The advantage here is that on a Spectra level all annotations of the same type can be kept in a single dataframe, or e.g. in a single table in a relational database backend. Also, dplyr-like operations on an entire set of annotations become feasible.
(Of course, PeakAnnotations of different types can also be kept in a single table, but it will be sparse. There is the further question if same-name fields from obviously different sources should be kept in the same column. A NoSQL-like document store / key-value store would be a more natural storage for mixed PeakAnnotations. Or the backend storage is segmented into tables based on the annotationType.)

I will be happy either way.

lgatto / Spectrum

Peak annotations #1