Closed jorainer closed 1 year ago
Copying from the other issue:
Even database connections would work, if the backend opens and closes the connection for each process.
Do you know of any way to force a side effect when opening a new process?
In my case, I could in theory if(!dbIsValid(db@con)) db@con <- dbConnect(...)
in every function (or better, get con
via an accessor function rather than from the slot). However this would reconnect on every use of the Spectra object until it is copied / subsetted / filtered, because the resulting new object with a working con
is ephemeral. Not sure if I am being clear.
(A way around it would be to additionally maintain a connection table in some global object in the background.)
Storing the con
in an environment
slot should work.
From my old days of java programming - what we did there is to initialize a pool of database connections and processes could get one from there - or a new one is created if none is available. what I don't quite like about that idea is that we would need to store the connection information within the backend. The question is really if there is any benefit from parallel database queries? some databases might not even allow that.
Regarding storing the con
in an environment
- are you sure about that? I guess that would not work on Windows because there a new R process is started for each parallel process and no memory is shared across them (thus I guess even there the environment
would need to be copied - but I didn't try...).
I haven't tried the benefits/disadvantages of keeping the connection in the backend with opening/closing the connection for each call. Both approaches should work. and the latter would support parallel processing...
The question is really if there is any benefit from parallel database queries? some databases might not even allow that.
There will be a benefit if there is significant client-side processing or postprocessing that drives overall computing time. The obvious example is spectral matching, where getting the records from the backend (no matter what backend) is a minor contributor to overall processing time. I have to admit that I don't have present where the parallelism comes in there. But there could even be delays purely on the backend side.
Regarding storing the con in an environment - are you sure about that?
I think it should work, but I didn't clearly explain what I was actually proposing, which I guess is similar to the Java pool idea. I'll try to make a code example.
Just a note on where and how Spectra
uses parallel processing: most functions with parallel processing will split
the Spectra
(based on a parameter f = dataStorage(object)
) and then call bplapply
on the list
of Spectra
returned by the split
.
compareSpectra
is a bit different, because I've seen that chunk-wise processing seems to be faster (see #116).
not that I'm against parallel processing - I'm happy to look again into that, but what I've realized over time that sometimes parallel processing requires so much overhead (copying of the objects, distributing them to the processes, collecting the results) that it does not really help unless you have computationally intense tasks.
In the previous comment I mentioned chunk-wise loading of the data that helped - maybe I can check if we gain significantly by performing then the calculations (once we're down to simple peak matrices) in parallel... haven't tried that yet.
That's now implemented. Higher level code should use BPPARAM <- backendBpparam(backend, BPPARAM)
that is supposed to return SerialParam()
if the backend does not support parallel processing. All functions in Spectra
use this.
This is to keep track of the issue from @meowcat (https://github.com/rformassspectrometry/Spectra/issues/262#issuecomment-1403846824):
Some backends, specifically those using SQL databases don't support parallel processing (because the SQL connection can not be shared across threads). We need to consider such situations since we can get unexpected errors on some backends.
What seems the best option is to add a method
backendBpparam
toMsBackend
with the following default:this should be called by all
Spectra
methods that run anyBiocParallel
method to validate/check the user definedBPPARAM
parameter. AMsBackendSql
backend would thus implement: