A tutorial and specification for building backends

meowcat commented 1 year ago

Hi,

I understand that this is a big wish, but: it would be great if there was a more formal introduction how to build a backend from zero.

In my existing work, I mostly started with MsBackendDataframe and bend things around until it does what I want. The problem is that this leads me to simply copy stuff into a dataframe-like structure and reusing MsBackendDataframe functionality rather than natively accessing the data in the best way possible.

Since Spectra aspires to be infrastructure and encourages developers to contribute backends, it would be great to have a more structured explanation:

what I absolutely need to implement for a simple NOP backend that returns zero spectra
step by step what do I need to add to provide data or accept writes
some specification about what behaviour is expected of my backend. Specifically, what is expected from my peaksData, spectraData, peaksData<-, spectraData<-, $, [[, $<-, [[<- and so on. Right now I find myself digging in the Spectra.R and other code for this to see how they are called.
what do I need to implement (i.e. what is abstract in OOP thinking) and what I may override (what is virtual).
some kind of tree showing what Spectra functionality depends on what MsBackend functionality.

Or what is the best resource you recommend for now?

jorainer commented 1 year ago

That's an excellent request indeed!

I will work on a small vignette illustrating the development of a new MsBackend. The starting point (and related documentation should be in MsBackend.R but I guess that's not detailed enough.

Adafede commented 1 year ago

👀 looking with much interest also

jorainer commented 1 year ago

I started writing yesterday. I might then also ask for your feedback @meowcat and @Adafede once it's progressed a bit.

jorainer commented 1 year ago

May I ask you for first feedback on the tutorial @meowcat and @Adafede ?

Please have a look at the PR #265 and add comments/notes (or eventually even clone my branch and suggest changes?). The PR contains some first descriptions and implementation notes. Would be good to know from you if you are OK with the format and structure.

Also please let me know what is unclear or where more description/details are needed!

Thanks!

I will continue adding content, but it might be good if you could already start checking the content.

meowcat commented 1 year ago

Hi,

thanks for the effort already!

Some questions that pop up:

May backendMerge also work with a different backend, or shall it only work with the same type?
What are the intended semantics of subsetting and of backendMerge?
- For example, if my backend is some MsBackendDatabase() backed by an on-disk database, is backendMerge(b1, b2) supposed to append the spectra b2 into the database b1?
- From how subsetting is used in practice, I understand it is a purely "filtering" operation and shall not actually do any action? Then what is the expected outcome of
```
b1 <- backendInitialize(MsBackendDatabase(), source = "database1.db")
b2 <- backendInitialize(MsBackendDatabase(), source = "database2.db")
b1sub <- b1[1:10]
bjoin <- backendMerge(b1sub, b2)
```
- If backendMerge is not supposed to append into a database, then how is appending supposed to happen?

jorainer commented 1 year ago

May backendMerge also work with a different backend, or shall it only work with the same type?

It is intended to be used for backends of the same type - but in the end it's also up to the developer. You can implement your method to also support merging different backends - in the end the method should return a class extending MsBackend.

subsetting: yes, so far it's purely filtering/subsetting (although you should also allow duplication e.g. x[c(1, 1, 1)]). But that's not different from the base R [ method.
backendMerge should allow to combine the data from different instances of a backend. Somewhat similar to c (only that you should also allow merging of backends that provide different spectra variables).
backendMerge(b1, b2) with b1 and b2 being a MsBackendDatabase should ideally merge the data into a database, yes. I think this can also be up to the developer how to do that (whether it's going to be into the database from b1 or into a completely new one). An alternative could also be to change the backends from MsBackendDatabase to e.g. MsBackendMemory and then to join these...

I have to admit that I did not (yet) implement a backendMerge for the MsBackendSql or any of the other SQL-backed backends (MsBackendMassbankSql or the one used in CompoundDb).

jorainer commented 1 year ago

Note: I've now finished the tutorial and merged all into the main branch - so, if you find typos or similar, please make a PR - also, please let me know if something is unclear or needs more details.

meowcat commented 1 year ago

Thanks!

Question: Parallel, BiocParallel etc.

For example in #249,

This function supports parallel processing which reduces the memory demand (only the peaks data of the currently processed files are loaded), but some backends (such as MsBackendSql) don't support parallel processing and hence the full data will be loaded and processed at once.

The vignette right now doesn't discuss parallel processing. What is expected of a backend for it to support parallel processing? Can it "signal" to Spectra that it does or doesn't support parallel processing? (I actually have some trouble with MsBackendMassbankSql that I "solve" by SerialParam().)

jorainer commented 1 year ago

this is an excellent point - where I also struggle at present. How would you solve that? Add a supportsParallelProcessing method to MsBackend (default TRUE, but backends can overwrite)?

meowcat commented 1 year ago

Add a supportsParallelProcessing method to MsBackend (default TRUE, but backends can overwrite)?

Something like this sounds sensible to me. But I somewhat have to return the question to you. I haven't studied the Spectra frontend (which does the parallel delegation etc) in enough detail.

is it even acceptable for an MsBackend to not support parallel processing?
what does the MsBackend dev need to know about the parallelism; i.e. what behaviour do I need to fulfill so that BiocParallel works?

meowcat commented 1 year ago

Accessor methods

It is my understanding that spectraData is required to return a superset of all the data accessible via accessor methods, except lengths, tic and spectraNames which are not spectraVariables (or is tic a spectraVariable?).

is this correct?
in that case, we expect that spectraData(object, "xxx")[, 1L] == xxx(object) for any object where xxx is any accessor? Is there any reason to implement an accessor method different from
```
setMethod("xxx", "MsBackendTest", function(object) {
spectraData(object, "xxx")[, 1L]
})
```
Why is there tic and ionCount, and why is the latter optional?
If isCentroided is a heuristic approach (as opposed to centroided), why should the MsBackend implement it, rather than Spectra directly? If this is a well-defined heuristic approach, shouldn't it be the same for all backends?

meowcat commented 1 year ago

Two expected behaviours that I deduce "by example", please confirm:

Subset operator [*: Based on existing examples (MsBackendMzR and MsBackendDataFrame), only integer index within [1, length(x)] should work, and out-of-bounds should throw an error.
- (Note that R is not very consistent in this, for example c(2,3,4)[c(3,4)] == c(4, NA), but list(2,3)[c(2,3)] == list(2, NULL).
- Curiously, for S4Vectors::DataFrame, DataFrame(blub=c(1,2,3), bla=c(2,3,4))[c(0,1),] returns row 1 with no error, but DataFrame(blub=c(1,2,3), bla=c(2,3,4))[c(1,5),] raises an error.)

$ operator: should throw an error if the column is not in spectraVariables, for example:

fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
sps_sciex <- Spectra(files = fls, source = MsBackendMzR())
sps_sciex@backend$rtimesdfgsdfg

jorainer commented 1 year ago

Regarding parallel processing: parallel processing is by default performed by file, i.e. using the dataStorage spectra variable. Thus, any backend that does not need some special non-sharable connection (such as a database connection) should work out of the box. Even database connections would work, if the backend opens and closes the connection for each process.

jorainer commented 1 year ago

I will expand the documentation/descriptions based on the questions you have. I try to answer them also here:

1) Regarding spectraData: yes, spectraData is expected to return the full data. How the individual accessor methods for spectra variables are implemented depends on the developer. Yes, you can do like you suggest and just have spectraData(object, "xxx")[, 1L], but for some variables it might be faster to directly access the data and not to create a DataFrame first and then subset that data frame again with [, 1L].

2) tic and ionCount - good question. This is something historical. So, yes, tic is a spectra variable - but has also the additional parameter initial that allows to calculate the values on the fly (which is the same what ionCount does. ionCount is optional, because there is a default implementation for MsBackend.

3) isCentroided: I would rather like to have default implementations for MsBackend then for Spectra. Yes, in this particular case it would work to have it for Spectra.

jorainer commented 1 year ago

subsetting with [: yes, an out-of-bounds error should be thrown. We usually use MsCoreUtils::i2index to check the i input parameter - I will update the example accordingly (and maybe add some dedicated unit tests for that too). Note also that [integer()] should also work and should return an empty backend (with 0 spectra).
Regarding $: yes, agree. I will update the examples. and add some tests to the test suite.

And generally, I really appreciate these questions! Helps to improve the documentation - and also identifying issues I did oversee. So, thanks a lot!

jorainer commented 1 year ago

Closing the issue now. Please re-open if some additional documentation/information should be added.

rformassspectrometry / Spectra

A tutorial and specification for building backends #262

Accessor methods