Open lgatto opened 5 years ago
I wonder whether it is wise to have m/z values in peaks
, since this means they need to be matched to mz
by matching the floating point number. What do you think about using vector ids instead?
ie.
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
c(2,3,4) ),
annotations = list("a5", "z3"))
pa
# corresponding to 100 -> a5, (136.1, 136.3, 136.9) -> z3
It's less convenient for direct understanding, but more database-like...
Wrt use cases, I'm already wondering whether we could want more than one PeakAnnotation-type listitem per spectrum. (I will follow up with details will follow for better understanding. It can be implemented with a single item but might be more canonical and useful with multiple.)
So rather than peak m/z, peak indices. Yes, I think that's better indeed.
Yes and no, because it introduces another problem... it would mean we have to keep track of the peaks.
# Starting as before
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(1,
c(2,3,4) ),
annotations = list("a5", "z3")
sp$peakAnnotation <- pa
# What happens if someone changes the peaklist?
# All pseudocode just for the concept, don't worry about datatypes etc
peaks <- sp[c("intensity", "mz")]
peaks <- peaks[peaks$mz > 120,]
sp[c("intensity", "mz")] <- peaks
Now the annotations point into nowhere, so what to do? Invalidate them?
On the other hand, if we run a "recalibration" operation on the peaks, we face the opposing problem, i.e. the peaks would remain mapped if using indices, but lose mapping if using m/z.
# using the m/z mapped annotations again
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
pa <- PeakAnnotation(peaks = list(100,
c(136.1, 136.3, 136.9),
annotations = list("a5", "z3"))
sp$peakAnnotation <- pa
# Let's run a recalibration
mz <- sp$mz
mzRecal <- predict(recalibrationModel, mz)
# mzRecal is now:
# c(100.2, 136.3, 136.5, 137.1, 200.1)
sp$mz <- mzRecal
Here indices would have retained the mapping, where m/z mapping would lose it.
The same issue holds weather we store indices or m/z - any operation that modifies the m/z values will run the risk of invalidating the peak annotations. We will have to take care of this possibility, either by updating the annotations accordingly, or, when not possible at all, by warning the user that they are potentially running a destructive operation.
Having said that, in many use cases, spectra won't be changed once they get annotations, of some (most?) annotations will become automated, so that the these issues will only plague a minority of users.
Up to now we always talked about a single slot called peakAnnotation
. Is there any reason why we want that instead of using the PeakAnnotation
for possibly multiple slots?
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
peptideAnnotation <- PeakAnnotation(peaks = list(100,
c(136.1, 136.3, 136.9),
annotations = list("a5", "z3"))
sp$peptideAnnotation <- peptideAnnotation
centroidMassAnnotation <- PeakAnnotation(peaks = list(100, c(136.1, 136,3, 136.9), 200),
annotations = list(100, 136.5, 200))
sp$centroidMassAnnotation <- centroidMassAnnotation
# Centroiding is not the best example, but I hope you get the idea
Further, automated annotations could be done like this:
annotatePeptides <- function(sp, threshold)
{
# some magic
mz <- filterThreshold(mz)
peptides <- magicFindPeptides(mz)
return(peptides)
}
sp$peptideAnnotations <- AutoPeakAnnotation(annotatePeptides, threshold=0.2)
Because (1) when you want to get all annotations of all your spectra, you don't want to first look for them and (2) these annotations will live as a single spectraData
PeakAnnotations
variable.
See my PR #6 for a use case as a basis for further discussion, where I provide two (IMO) useful peak annotators for small molecules.
How do we best use peak annotations? In my example, we have
pa1 <- annotateFormula(sp, ...)
# pa1@annotation elements are list(mzCalculated, formula, dppm, dbe)
pa2 <- annotateCfmId(sp, ...)
# pa2@annotation elements are list(mzCalculated, smiles)
What to do with multiple annotation types?
pa1
and pa2
, keep one peakAnnotation per spectrum. Advantage: it is stored away in a single slot. Disadvantage: in a relational database backend, these fields are not directly searchable.@annotationType
slot in PeakAnnotation
so the user can get peakAnnotation(sp, type="cfmId")
(or whatever the interface). DataFrame
(with @peaks
as a list
column as-is, i.e. the centroided annotation is kept as a single row). So we can provide as("DataFrame")
or the user can write it himself.sp$peakAnnotation <- pa1
sp$independentPeakAnnotation <- pa2
If we want to discourage this, we should offer another solution and/or forbid it.
peakAnnotation
slot and leave it to the users to make one or multiple. If we opt for this, I would suggest making PeakAnnotation
just a DataFrame
with mandatory column peaks
. The advantage here is that on a Spectra
level all annotations of the same type can be kept in a single dataframe, or e.g. in a single table in a relational database backend. Also, dplyr
-like operations on an entire set of annotations become feasible.PeakAnnotation
s of different types can also be kept in a single table, but it will be sparse. There is the further question if same-name fields from obviously different sources should be kept in the same column. A NoSQL-like document store / key-value store would be a more natural storage for mixed PeakAnnotation
s. Or the backend storage is segmented into tables based on the annotationType
.)I will be happy either way.
We need to define how to best define peak annotations, keeping in mind that these need to work for profile and centroided spectra. The current test implementation is very loose, and only defines lists of peaks and their corresponding annotations.
I suggest start by we defining needs and use cases and move to implementation details thereafter.
Cc @jorainer @meowcat