Parallelisation support for analyse

gowerc commented 1 month ago

Closes #370

Some general notes:

I gave up on future() in the end. I just ran into edge case after edge case that made it too difficult to use and test.
I opted for just using parallel but allowing uses to provide their own clusters which can have their functions / libraries already loaded
I then created a helper function make_rbmi_cluster() to make this setup process as smooth as possible
This also allows for the re-use of clusters between runs to avoid having to repeatedly setup and tear down processes which is very expensive
Unfortunately the performance gain was pretty underwhelming, I'm sure it will be better with analysis functions that are more expensive than lm but currently I am only seeing a 30% gain when using large samples e.g. >2000 for less than that it often takes longer to run in parallel. This is mostly due to the IO exchange by the looks of it. R's PSOCK clusters just transfer data exceedingly slowly :(

gravesti commented 1 month ago

That's a pity that future couldn't work. It's limited to PSOCK now, right? I made a few comments and will test a bit this afternoon.

gowerc commented 1 month ago

Added some additional changes, main one here is that I change analyse to transfer data into the sub-processes via disk instead of via network. I must admit I am a little unsure on this as I'm not sure how reliable parallel file reads are, though seems to work fine in my testing so far ...

I also added a .validate argument to skip the input validation tests that were taking a surprising amount of time

gowerc commented 1 month ago

@gravesti - It's just dawned on me that the usage of RDS files to load data into the parallel processes means we are limiting parallelisation to only function on local machines. I think (at least from the documentation) that PSOCK supports running parallel processes on remote machines which wouldn't work in this case. The RDS loading saves ~3/4 seconds so I'm not sure wether to switch back to network loading or to just put a big warning in the documentation pages or provide a toggle-able option.

gravesti commented 1 month ago

@gowerc Have we been testing on a realistic analysis size? If so, then maybe it is ok to be limited to a single machine and just have a warning in the docs.

I guess the same argument applies to not using RDS. Accept the few seconds penalty for less complexity and no restrictions or extra arguments.

gowerc commented 1 month ago

Talking to Marcel and Alessandro I think a realistic sample size would be in the order of 1000-2000 e.g. our test code is likely in the right region (e.g. 20-30 seconds in sequential). Though I think a common use case for this will be in tipping point analyses where the same call to analyse() is run 20-40 times with slightly different offsets. This is an awkward one where the answer is really "it depends...".

I think my current assumption would be that people who are using rbmis internal paralisation are likely just speeding up individual runs on their local machines. If a user is getting to the stage where they need to run it across different nodes (say for multiple sensitivity analyses on different sub-endpoints) they are more likely going to be not using rbmi's internal parallisation and instead be parallelising the different runs of rbmi externally to rbmi.

Given how inefficient the internal parallelisation of rbmi is for analyse() the idea of recruiting a whole cluster of nodes seems ridiculous so I guess I'd rather support the optimisation on a local machine.

Then again I fear people will just assume this functionality is fully compatible with parallels remote PSOCK setup and run it anyway, e.g. we are breaking that interface...

I think on balance I'm inclined to leave it as is and add a warning note in the documentation.

gravesti commented 1 month ago

@gowerc Will you merge the lintr branch first? Do you still need to update the docs in this PR?

gowerc commented 1 month ago

@gravesti - Have merged in the lintr PR and have added an additional warning regarding the psock network stuff.

gowerc commented 1 month ago

Urgh, just seen theres a bunch of notes that I need to resolve:

* checking R code for possible problems ... NOTE
analyse : <anonymous> : inner_fun: no visible binding for global
  variable ‘..rbmi..analysis..imputations’
analyse : <anonymous> : inner_fun: no visible binding for global
  variable ‘..rbmi..analysis..delta’
analyse : <anonymous> : inner_fun: no visible global function
  definition for ‘..rbmi..analysis..fun’
Undefined global functions or variables:
  ..rbmi..analysis..delta ..rbmi..analysis..fun
  ..rbmi..analysis..imputations

gowerc commented 1 month ago

Ok I think that last commit should address the NOTE

insightsengineering / rbmi

Parallelisation support for analyse #435