lgatto / Spectrum

Spectrum Infrastructure for Mass Spectrometry Data
2 stars 1 forks source link

Reduce number of fields - calculate on-the-fly when possible #4

Open lgatto opened 5 years ago

lgatto commented 5 years ago

Do we really need relativeIntensity as a field? I would rather define an accessor method for that and calculate the relative intensities on the fly. Same for peaks count or tic, IMHO it is better to calculate these values on the fly because then there is no longer the need to update/change these fields whenever you do some sort of data manipulations.

Originally posted by @jorainer in https://github.com/lgatto/Spectrum/issues/3#issuecomment-485670661

lgatto commented 5 years ago

Yes, but it is difficult to predict when this will become expensive computationally, and when it will become easier/faster to pre-compute values (at the cost of keeping them up-t0-date).

My suggestion would be to start with computing on the fly, and moving to pre-computed when we observe a hit performance-wise.

meowcat commented 5 years ago

What about allowing the user to add computed fields?

sp <- Spectrum()
sp[c("intensity", "mz")] <- list(c(100, 136.1, 136,3, 136.9, 200), c(20, 5, 50, 5, 20))

relativeIntensity <- function(sp, maxInt = 1000) ((sp$mz / max(sp$mz)) / maxInt)

sp$relativeIntensity <- AutoPeakVector( relativeIntensity, max=1000 )
jorainer commented 5 years ago

No objection to add any user defined fields - but a relativeIntensity method that calculates on the fly would be just handy, imagine you want to use a different max value etc. And, as @lgatto mentioned, the real problem is that it's tricky to know when to update the field. Some data manipulation methods like smooth will affect the intensities of a Spectrum, but not the number of peaks. So, a pre-computed peakCount will still be OK, but a pre-computed tic will be wrong and also a pre-computed relativeIntensity might be invalid.

So, it might be a little slower to calculate stuff on-the-fly, but the code and content is easier to maintain.

meowcat commented 5 years ago

I am (or I was) working on a pull request that would allow cached autocomputed fields, i.e. which get computed and cached only on demand (on first access), and invalidated when the Spectrum changes. However I am realizing (as I should have immediately) that this doesn't work except if we introduced an environment to hold the cache, because of R's pass-by-value logic. It would only work with sp <- refresh(sp) before accessing the values, which you probably agree is awkward.

What would work is to recompute all of them on every spectrum change, but that feels like it could be too expensive if they are never needed by many users? Also, would such a concept fit into the Spectra world?

jorainer commented 5 years ago

For Spectra or better said their backends I create the content on-the-fly (e.g. https://github.com/lgatto/Spectra/blob/jomaster/R/MsBackendDataFrame.R#L73-L78) and I don't see any problem with that - getting the full spectra data (as a DataFrame) works in an analogous way (https://github.com/lgatto/Spectra/blob/jomaster/R/MsBackendDataFrame.R#L273-L291). mandatory spectra variables (such as msLevel etc) get created and initialized with the correct NA values if they are not present.

meowcat commented 5 years ago

But there is a conceptual issue with computing (all) values on the fly. If a spectrum is loaded from a raw file, database, or library file format (e.g. MSP) there are both properties and annotations stored there.

If we want to validate a spectrum file, for example, it would be very desirable that the values in the corresponding object correspond to the ground truth in the file. If I load a spectrum with 10 peaks but it claims num_peaks: 9 I can mark it as invalid; if peaksCount is recalculated on the fly I will never catch this.

I guess the Spectra answer would be that we need a backend for the file type that we are reading, instead of checking it once we have loaded it into a Spectra/um?

Similarly, many annotations do not lose value when the spectrum is subsetted (e.g. partial structures, peptide annotations) and are non-trivial to recalculate. Some might come from R, but again some may be from a database ground truth and there is no obvious way to recompute them if lost.

So regardless of that it is a good idea to recompute many things on the fly, we need to handle non-recalculated fields in some way.