Open lgatto opened 5 years ago
Yes, but it is difficult to predict when this will become expensive computationally, and when it will become easier/faster to pre-compute values (at the cost of keeping them up-t0-date).
My suggestion would be to start with computing on the fly, and moving to pre-computed when we observe a hit performance-wise.
What about allowing the user to add computed fields?
sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))
relativeIntensity <- function(sp, maxInt = 1000) ((sp$mz / max(sp$mz)) / maxInt)
sp$relativeIntensity <- AutoPeakVector( relativeIntensity, max=1000 )
No objection to add any user defined fields - but a relativeIntensity
method that calculates on the fly would be just handy, imagine you want to use a different max value etc. And, as @lgatto mentioned, the real problem is that it's tricky to know when to update the field. Some data manipulation methods like smooth
will affect the intensities of a Spectrum
, but not the number of peaks. So, a pre-computed peakCount
will still be OK, but a pre-computed tic
will be wrong and also a pre-computed relativeIntensity
might be invalid.
So, it might be a little slower to calculate stuff on-the-fly, but the code and content is easier to maintain.
I am (or I was) working on a pull request that would allow cached autocomputed fields, i.e. which get computed and cached only on demand (on first access), and invalidated when the Spectrum changes. However I am realizing (as I should have immediately) that this doesn't work except if we introduced an environment
to hold the cache, because of R's pass-by-value logic. It would only work with sp <- refresh(sp)
before accessing the values, which you probably agree is awkward.
What would work is to recompute all of them on every spectrum change, but that feels like it could be too expensive if they are never needed by many users? Also, would such a concept fit into the Spectra
world?
For Spectra
or better said their backends I create the content on-the-fly (e.g. https://github.com/lgatto/Spectra/blob/jomaster/R/MsBackendDataFrame.R#L73-L78) and I don't see any problem with that - getting the full spectra data (as a DataFrame
) works in an analogous way (https://github.com/lgatto/Spectra/blob/jomaster/R/MsBackendDataFrame.R#L273-L291). mandatory spectra variables (such as msLevel
etc) get created and initialized with the correct NA
values if they are not present.
But there is a conceptual issue with computing (all) values on the fly. If a spectrum is loaded from a raw file, database, or library file format (e.g. MSP) there are both properties and annotations stored there.
If we want to validate a spectrum file, for example, it would be very desirable that the values in the corresponding object correspond to the ground truth in the file. If I load a spectrum with 10 peaks but it claims num_peaks: 9
I can mark it as invalid; if peaksCount is recalculated on the fly I will never catch this.
I guess the Spectra
answer would be that we need a backend for the file type that we are reading, instead of checking it once we have loaded it into a Spectra/um
?
Similarly, many annotations do not lose value when the spectrum is subsetted (e.g. partial structures, peptide annotations) and are non-trivial to recalculate. Some might come from R, but again some may be from a database ground truth and there is no obvious way to recompute them if lost.
So regardless of that it is a good idea to recompute many things on the fly, we need to handle non-recalculated fields in some way.
Do we really need
relativeIntensity
as a field? I would rather define an accessor method for that and calculate the relative intensities on the fly. Same for peaks count or tic, IMHO it is better to calculate these values on the fly because then there is no longer the need to update/change these fields whenever you do some sort of data manipulations.Originally posted by @jorainer in https://github.com/lgatto/Spectrum/issues/3#issuecomment-485670661