Closed gowerc closed 1 month ago
That's a pity that future
couldn't work. It's limited to PSOCK now, right?
I made a few comments and will test a bit this afternoon.
Added some additional changes, main one here is that I change analyse to transfer data into the sub-processes via disk instead of via network. I must admit I am a little unsure on this as I'm not sure how reliable parallel file reads are, though seems to work fine in my testing so far ...
I also added a .validate
argument to skip the input validation tests that were taking a surprising amount of time
@gravesti - It's just dawned on me that the usage of RDS files to load data into the parallel processes means we are limiting parallelisation to only function on local machines. I think (at least from the documentation) that PSOCK supports running parallel processes on remote machines which wouldn't work in this case. The RDS loading saves ~3/4 seconds so I'm not sure wether to switch back to network loading or to just put a big warning in the documentation pages or provide a toggle-able option.
@gowerc Have we been testing on a realistic analysis size? If so, then maybe it is ok to be limited to a single machine and just have a warning in the docs.
I guess the same argument applies to not using RDS. Accept the few seconds penalty for less complexity and no restrictions or extra arguments.
Talking to Marcel and Alessandro I think a realistic sample size would be in the order of 1000-2000 e.g. our test code is likely in the right region (e.g. 20-30 seconds in sequential). Though I think a common use case for this will be in tipping point analyses where the same call to analyse()
is run 20-40 times with slightly different offsets. This is an awkward one where the answer is really "it depends...".
I think my current assumption would be that people who are using rbmis internal paralisation are likely just speeding up individual runs on their local machines. If a user is getting to the stage where they need to run it across different nodes (say for multiple sensitivity analyses on different sub-endpoints) they are more likely going to be not using rbmi's internal parallisation and instead be parallelising the different runs of rbmi externally to rbmi.
Given how inefficient the internal parallelisation of rbmi is for analyse()
the idea of recruiting a whole cluster of nodes seems ridiculous so I guess I'd rather support the optimisation on a local machine.
Then again I fear people will just assume this functionality is fully compatible with parallels remote PSOCK setup and run it anyway, e.g. we are breaking that interface...
I think on balance I'm inclined to leave it as is and add a warning note in the documentation.
@gowerc Will you merge the lintr branch first? Do you still need to update the docs in this PR?
@gravesti - Have merged in the lintr PR and have added an additional warning regarding the psock network stuff.
Urgh, just seen theres a bunch of notes that I need to resolve:
* checking R code for possible problems ... NOTE
analyse : <anonymous> : inner_fun: no visible binding for global
variable ‘..rbmi..analysis..imputations’
analyse : <anonymous> : inner_fun: no visible binding for global
variable ‘..rbmi..analysis..delta’
analyse : <anonymous> : inner_fun: no visible global function
definition for ‘..rbmi..analysis..fun’
Undefined global functions or variables:
..rbmi..analysis..delta ..rbmi..analysis..fun
..rbmi..analysis..imputations
Ok I think that last commit should address the NOTE
Closes #370
Some general notes:
future()
in the end. I just ran into edge case after edge case that made it too difficult to use and test.make_rbmi_cluster()
to make this setup process as smooth as possiblelm
but currently I am only seeing a 30% gain when using large samples e.g. >2000 for less than that it often takes longer to run in parallel. This is mostly due to the IO exchange by the looks of it. R's PSOCK clusters just transfer data exceedingly slowly :(