rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Check if a backend supports parallel processing #271

Closed jorainer closed 1 year ago

jorainer commented 1 year ago

This is to keep track of the issue from @meowcat (https://github.com/rformassspectrometry/Spectra/issues/262#issuecomment-1403846824):

Some backends, specifically those using SQL databases don't support parallel processing (because the SQL connection can not be shared across threads). We need to consider such situations since we can get unexpected errors on some backends.

What seems the best option is to add a method backendBpparam to MsBackend with the following default:

setMethod("backendBpparam", "MsBackend", function(object, BPPARAM = bpparam()) {
    BPPARAM
})

this should be called by all Spectra methods that run any BiocParallel method to validate/check the user defined BPPARAM parameter. A MsBackendSql backend would thus implement:

setMethod("backendBpparam", "MsBackendSql", function(object, BPPARAM = bpparam()) {
    if (!inherits(BPPARAM, "SerialParam"))
        message("Disabled parallel processing since 'MsBackendSql' does not support that")
    SerialParam()
})
meowcat commented 1 year ago

Copying from the other issue:

Even database connections would work, if the backend opens and closes the connection for each process.

Do you know of any way to force a side effect when opening a new process?

In my case, I could in theory if(!dbIsValid(db@con)) db@con <- dbConnect(...) in every function (or better, get con via an accessor function rather than from the slot). However this would reconnect on every use of the Spectra object until it is copied / subsetted / filtered, because the resulting new object with a working con is ephemeral. Not sure if I am being clear.

(A way around it would be to additionally maintain a connection table in some global object in the background.)

meowcat commented 1 year ago

Storing the con in an environment slot should work.

jorainer commented 1 year ago

From my old days of java programming - what we did there is to initialize a pool of database connections and processes could get one from there - or a new one is created if none is available. what I don't quite like about that idea is that we would need to store the connection information within the backend. The question is really if there is any benefit from parallel database queries? some databases might not even allow that.

Regarding storing the con in an environment - are you sure about that? I guess that would not work on Windows because there a new R process is started for each parallel process and no memory is shared across them (thus I guess even there the environment would need to be copied - but I didn't try...).

I haven't tried the benefits/disadvantages of keeping the connection in the backend with opening/closing the connection for each call. Both approaches should work. and the latter would support parallel processing...

meowcat commented 1 year ago

The question is really if there is any benefit from parallel database queries? some databases might not even allow that.

There will be a benefit if there is significant client-side processing or postprocessing that drives overall computing time. The obvious example is spectral matching, where getting the records from the backend (no matter what backend) is a minor contributor to overall processing time. I have to admit that I don't have present where the parallelism comes in there. But there could even be delays purely on the backend side.

Regarding storing the con in an environment - are you sure about that?

I think it should work, but I didn't clearly explain what I was actually proposing, which I guess is similar to the Java pool idea. I'll try to make a code example.

jorainer commented 1 year ago

Just a note on where and how Spectra uses parallel processing: most functions with parallel processing will split the Spectra (based on a parameter f = dataStorage(object)) and then call bplapply on the list of Spectra returned by the split.

compareSpectra is a bit different, because I've seen that chunk-wise processing seems to be faster (see #116).

jorainer commented 1 year ago

not that I'm against parallel processing - I'm happy to look again into that, but what I've realized over time that sometimes parallel processing requires so much overhead (copying of the objects, distributing them to the processes, collecting the results) that it does not really help unless you have computationally intense tasks.

In the previous comment I mentioned chunk-wise loading of the data that helped - maybe I can check if we gain significantly by performing then the calculations (once we're down to simple peak matrices) in parallel... haven't tried that yet.

jorainer commented 1 year ago

That's now implemented. Higher level code should use BPPARAM <- backendBpparam(backend, BPPARAM) that is supposed to return SerialParam() if the backend does not support parallel processing. All functions in Spectra use this.