mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

new feature making clustermq "pipeable" #318

Open wds15 opened 8 months ago

wds15 commented 8 months ago

Hi!

First, clustermq is really great - it powers a lot of what I do. Today I just wrote a small utility function which makes the "Q" functions compatible with the pipe syntax which is being used a lot in R workflows. So maybe this function could be implemented in clustermq directly?

library(brms)
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, ...) {
    data |>
        dplyr::mutate(.row=1:dplyr::n()) |>
        tidyr::nest(data=-.row) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}

## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, newdata, const=list(object=fit1))

The above makes more sense for huge simulations and fits. What would be nice to add is chunking in a way so that the "data" is being chunked into bigger pieces... which should be easy to add.

This is just a feature suggestion as I think this could be useful for many others as well.

wds15 commented 8 months ago

Here is an improved version which is a bit more clever on the first argument name and does chunking, which can speed up things a lot:

library(brms)
library(tidybayes)
library(dplyr)
library(tidyr)

fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, chunk_size=1, ...) {
    if(missing(arg)) {
        arg <- rlang::sym(names(formals(fun))[1])
    }
    data |>
        dplyr::mutate(.chunk=sort(rep(seq_len(ceiling(dplyr::n()/chunk_size)), length.out=dplyr::n()))) |>
        tidyr::nest(data=-.chunk) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}

## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, const=list(object=fit1), pkgs="tidybayes", n_jobs=6)
mschubert commented 7 months ago

Thanks for the idea and great to hear that the package is working well for you!

The way I understand it, you want to pass a row or a number of rows of a data frame as one combined argument to a function.

Instead of nesting the data, I would go about it like this:

with_rvars = clustermq::Q(
    tidybayes::add_predicted_rvars,
    newdata = split(epilepsy, seq_len(nrow(epilepsy))),
    const = list(object=fit1),
    n_jobs = 6
) |> bind_rows()

That looks fairly straightforward to me. clustermq will chunk bigger data for you, which you could add manually if calling tidybayes::add_predicted_rvars once per row adds too much overhead.

I'm not sure if adding a new concept like Q_rows_nested will make the package easier to use overall. Rather, I'd prefer to only add new functionality if a task can't be (easily) done with the existing API.

What do you think?

wds15 commented 7 months ago

Nice alternative version. However, it is not "pipeable" - so the user cannot pipe into a Q boosted thing.

The other day I had the thought that one should probably refine this towards a "Q_mutate" function which would even avoid the need for the user to define intermediate functions, which one would need if one would like to operate on multiple columns at once.

I totally agree with not bloating a package with unnecessary code, for sure. How about we let this issue around for a moment so that we collect better ideas of the above function... and finally include this in some form in the documentation? An example, a section in the pkgdown homepage or something similar?

mschubert commented 7 months ago

Happy to leave this open for a while and see what we come up with!