rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

A tutorial and specification for building backends #262

Closed meowcat closed 1 year ago

meowcat commented 1 year ago

Hi,

I understand that this is a big wish, but: it would be great if there was a more formal introduction how to build a backend from zero.

In my existing work, I mostly started with MsBackendDataframe and bend things around until it does what I want. The problem is that this leads me to simply copy stuff into a dataframe-like structure and reusing MsBackendDataframe functionality rather than natively accessing the data in the best way possible.

Since Spectra aspires to be infrastructure and encourages developers to contribute backends, it would be great to have a more structured explanation:

Or what is the best resource you recommend for now?

jorainer commented 1 year ago

That's an excellent request indeed!

I will work on a small vignette illustrating the development of a new MsBackend. The starting point (and related documentation should be in MsBackend.R but I guess that's not detailed enough.

Adafede commented 1 year ago

👀 looking with much interest also

jorainer commented 1 year ago

I started writing yesterday. I might then also ask for your feedback @meowcat and @Adafede once it's progressed a bit.

jorainer commented 1 year ago

May I ask you for first feedback on the tutorial @meowcat and @Adafede ?

Please have a look at the PR #265 and add comments/notes (or eventually even clone my branch and suggest changes?). The PR contains some first descriptions and implementation notes. Would be good to know from you if you are OK with the format and structure.

Also please let me know what is unclear or where more description/details are needed!

Thanks!

I will continue adding content, but it might be good if you could already start checking the content.

meowcat commented 1 year ago

Hi,

thanks for the effort already!

Some questions that pop up:

jorainer commented 1 year ago

May backendMerge also work with a different backend, or shall it only work with the same type?

It is intended to be used for backends of the same type - but in the end it's also up to the developer. You can implement your method to also support merging different backends - in the end the method should return a class extending MsBackend.

I have to admit that I did not (yet) implement a backendMerge for the MsBackendSql or any of the other SQL-backed backends (MsBackendMassbankSql or the one used in CompoundDb).

jorainer commented 1 year ago

Note: I've now finished the tutorial and merged all into the main branch - so, if you find typos or similar, please make a PR - also, please let me know if something is unclear or needs more details.

meowcat commented 1 year ago

Thanks!

Question: Parallel, BiocParallel etc.

For example in #249,

This function supports parallel processing which reduces the memory demand (only the peaks data of the currently processed files are loaded), but some backends (such as MsBackendSql) don't support parallel processing and hence the full data will be loaded and processed at once.

The vignette right now doesn't discuss parallel processing. What is expected of a backend for it to support parallel processing? Can it "signal" to Spectra that it does or doesn't support parallel processing? (I actually have some trouble with MsBackendMassbankSql that I "solve" by SerialParam().)

jorainer commented 1 year ago

this is an excellent point - where I also struggle at present. How would you solve that? Add a supportsParallelProcessing method to MsBackend (default TRUE, but backends can overwrite)?

meowcat commented 1 year ago

Add a supportsParallelProcessing method to MsBackend (default TRUE, but backends can overwrite)?

Something like this sounds sensible to me. But I somewhat have to return the question to you. I haven't studied the Spectra frontend (which does the parallel delegation etc) in enough detail.

meowcat commented 1 year ago

Accessor methods

It is my understanding that spectraData is required to return a superset of all the data accessible via accessor methods, except lengths, tic and spectraNames which are not spectraVariables (or is tic a spectraVariable?).

meowcat commented 1 year ago

Two expected behaviours that I deduce "by example", please confirm:

jorainer commented 1 year ago

Regarding parallel processing: parallel processing is by default performed by file, i.e. using the dataStorage spectra variable. Thus, any backend that does not need some special non-sharable connection (such as a database connection) should work out of the box. Even database connections would work, if the backend opens and closes the connection for each process.

jorainer commented 1 year ago

I will expand the documentation/descriptions based on the questions you have. I try to answer them also here:

1) Regarding spectraData: yes, spectraData is expected to return the full data. How the individual accessor methods for spectra variables are implemented depends on the developer. Yes, you can do like you suggest and just have spectraData(object, "xxx")[, 1L], but for some variables it might be faster to directly access the data and not to create a DataFrame first and then subset that data frame again with [, 1L].

2) tic and ionCount - good question. This is something historical. So, yes, tic is a spectra variable - but has also the additional parameter initial that allows to calculate the values on the fly (which is the same what ionCount does. ionCount is optional, because there is a default implementation for MsBackend.

3) isCentroided: I would rather like to have default implementations for MsBackend then for Spectra. Yes, in this particular case it would work to have it for Spectra.

jorainer commented 1 year ago
  1. subsetting with [: yes, an out-of-bounds error should be thrown. We usually use MsCoreUtils::i2index to check the i input parameter - I will update the example accordingly (and maybe add some dedicated unit tests for that too). Note also that [integer()] should also work and should return an empty backend (with 0 spectra).

  2. Regarding $: yes, agree. I will update the examples. and add some tests to the test suite.

And generally, I really appreciate these questions! Helps to improve the documentation - and also identifying issues I did oversee. So, thanks a lot!

jorainer commented 1 year ago

Closing the issue now. Please re-open if some additional documentation/information should be added.