Closed meowcat closed 1 year ago
That's an excellent request indeed!
I will work on a small vignette illustrating the development of a new MsBackend
. The starting point (and related documentation should be in MsBackend.R but I guess that's not detailed enough.
👀 looking with much interest also
I started writing yesterday. I might then also ask for your feedback @meowcat and @Adafede once it's progressed a bit.
May I ask you for first feedback on the tutorial @meowcat and @Adafede ?
Please have a look at the PR #265 and add comments/notes (or eventually even clone my branch and suggest changes?). The PR contains some first descriptions and implementation notes. Would be good to know from you if you are OK with the format and structure.
Also please let me know what is unclear or where more description/details are needed!
Thanks!
I will continue adding content, but it might be good if you could already start checking the content.
Hi,
thanks for the effort already!
Some questions that pop up:
backendMerge
also work with a different backend, or shall it only work with the same type?backendMerge
?
MsBackendDatabase()
backed by an on-disk database, is backendMerge(b1, b2)
supposed to append the spectra b2 into the database b1?b1 <- backendInitialize(MsBackendDatabase(), source = "database1.db")
b2 <- backendInitialize(MsBackendDatabase(), source = "database2.db")
b1sub <- b1[1:10]
bjoin <- backendMerge(b1sub, b2)
backendMerge
is not supposed to append into a database, then how is appending supposed to happen?May
backendMerge
also work with a different backend, or shall it only work with the same type?
It is intended to be used for backends of the same type - but in the end it's also up to the developer. You can implement your method to also support merging different backends - in the end the method should return a class extending MsBackend
.
subsetting: yes, so far it's purely filtering/subsetting (although you should also allow duplication e.g. x[c(1, 1, 1)]
). But that's not different from the base R [
method.
backendMerge
should allow to combine the data from different instances of a backend. Somewhat similar to c
(only that you should also allow merging of backends that provide different spectra variables).
backendMerge(b1, b2)
with b1
and b2
being a MsBackendDatabase
should ideally merge the data into a database, yes. I think this can also be up to the developer how to do that (whether it's going to be into the database from b1
or into a completely new one). An alternative could also be to change the backends from MsBackendDatabase
to e.g. MsBackendMemory
and then to join these...
I have to admit that I did not (yet) implement a backendMerge
for the MsBackendSql
or any of the other SQL-backed backends (MsBackendMassbankSql
or the one used in CompoundDb
).
Note: I've now finished the tutorial and merged all into the main branch - so, if you find typos or similar, please make a PR - also, please let me know if something is unclear or needs more details.
Thanks!
Question: Parallel, BiocParallel
etc.
For example in #249,
This function supports parallel processing which reduces the memory demand (only the peaks data of the currently processed files are loaded), but some backends (such as
MsBackendSql
) don't support parallel processing and hence the full data will be loaded and processed at once.
The vignette right now doesn't discuss parallel processing. What is expected of a backend for it to support parallel processing? Can it "signal" to Spectra
that it does or doesn't support parallel processing? (I actually have some trouble with MsBackendMassbankSql
that I "solve" by SerialParam()
.)
this is an excellent point - where I also struggle at present. How would you solve that? Add a supportsParallelProcessing
method to MsBackend
(default TRUE
, but backends can overwrite)?
Add a
supportsParallelProcessing
method toMsBackend
(defaultTRUE
, but backends can overwrite)?
Something like this sounds sensible to me. But I somewhat have to return the question to you. I haven't studied the Spectra
frontend (which does the parallel delegation etc) in enough detail.
MsBackend
to not support parallel processing?MsBackend
dev need to know about the parallelism; i.e. what behaviour do I need to fulfill so that BiocParallel
works?It is my understanding that spectraData
is required to return a superset of all the data accessible via accessor methods, except lengths
, tic
and spectraNames
which are not spectraVariables
(or is tic
a spectraVariable?).
is this correct?
in that case, we expect that spectraData(object, "xxx")[, 1L] == xxx(object)
for any object where xxx is any accessor? Is there any reason to implement an accessor method different from
setMethod("xxx", "MsBackendTest", function(object) {
spectraData(object, "xxx")[, 1L]
})
Why is there tic
and ionCount
, and why is the latter optional?
If isCentroided
is a heuristic approach (as opposed to centroided
), why should the MsBackend
implement it, rather than Spectra
directly? If this is a well-defined heuristic approach, shouldn't it be the same for all backends?
Two expected behaviours that I deduce "by example", please confirm:
Subset operator [
*: Based on existing examples (MsBackendMzR and MsBackendDataFrame), only integer index within [1, length(x)]
should work, and out-of-bounds should throw an error.
c(2,3,4)[c(3,4)] == c(4, NA)
, but list(2,3)[c(2,3)] == list(2, NULL)
.S4Vectors::DataFrame
, DataFrame(blub=c(1,2,3), bla=c(2,3,4))[c(0,1),]
returns row 1 with no error, but DataFrame(blub=c(1,2,3), bla=c(2,3,4))[c(1,5),]
raises an error.)$
operator: should throw an error if the column is not in spectraVariables
, for example:
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
sps_sciex <- Spectra(files = fls, source = MsBackendMzR())
sps_sciex@backend$rtimesdfgsdfg
Regarding parallel processing: parallel processing is by default performed by file, i.e. using the dataStorage
spectra variable. Thus, any backend that does not need some special non-sharable connection (such as a database connection) should work out of the box. Even database connections would work, if the backend opens and closes the connection for each process.
I will expand the documentation/descriptions based on the questions you have. I try to answer them also here:
1) Regarding spectraData
: yes, spectraData
is expected to return the full data. How the individual accessor methods for spectra variables are implemented depends on the developer. Yes, you can do like you suggest and just have spectraData(object, "xxx")[, 1L]
, but for some variables it might be faster to directly access the data and not to create a DataFrame
first and then subset that data frame again with [, 1L]
.
2) tic
and ionCount
- good question. This is something historical. So, yes, tic
is a spectra variable - but has also the additional parameter initial
that allows to calculate the values on the fly (which is the same what ionCount
does. ionCount
is optional, because there is a default implementation for MsBackend
.
3) isCentroided
: I would rather like to have default implementations for MsBackend
then for Spectra
. Yes, in this particular case it would work to have it for Spectra
.
subsetting with [
: yes, an out-of-bounds error should be thrown. We usually use MsCoreUtils::i2index
to check the i
input parameter - I will update the example accordingly (and maybe add some dedicated unit tests for that too). Note also that [integer()]
should also work and should return an empty backend (with 0 spectra).
Regarding $
: yes, agree. I will update the examples. and add some tests to the test suite.
And generally, I really appreciate these questions! Helps to improve the documentation - and also identifying issues I did oversee. So, thanks a lot!
Closing the issue now. Please re-open if some additional documentation/information should be added.
Hi,
I understand that this is a big wish, but: it would be great if there was a more formal introduction how to build a backend from zero.
In my existing work, I mostly started with
MsBackendDataframe
and bend things around until it does what I want. The problem is that this leads me to simply copy stuff into a dataframe-like structure and reusingMsBackendDataframe
functionality rather than natively accessing the data in the best way possible.Since
Spectra
aspires to be infrastructure and encourages developers to contribute backends, it would be great to have a more structured explanation:peaksData
,spectraData
,peaksData<-
,spectraData<-
,$
,[[
,$<-
,[[<-
and so on. Right now I find myself digging in theSpectra.R
and other code for this to see how they are called.abstract
in OOP thinking) and what I may override (what isvirtual
).Spectra
functionality depends on whatMsBackend
functionality.Or what is the best resource you recommend for now?