Parallelise styling - Githubissues

lorenzwalthert commented 7 years ago

Functions like style_pkg() could be parallelised well, at least in principle - and speed up styling much.

For pre-commit hook, we should probably disable this.

Edit: a related issue is caching of results to the end of improving speed. See #320.

lorenzwalthert commented 6 years ago

Reference: #370.

krlmlr commented 6 years ago

Could be on by default, works for me except for the output of incomplete lines (to indicate progress). I think we can solve this by printing complete lines after completion in the multicore case.

We need to evaluate if it's worth to enable multiprocess on Windows.

Check if purrr::map() already supports parallel execution when implementing this: https://github.com/tidyverse/purrr/issues/121#issuecomment-371446577.

pat-s commented 5 years ago

Can you give me a heads-up on the current status here? I was planning to create a macro for tic (running on Travis CI for every build) that uses multicore support.

I guess no one knows when purrr will have a native integration so using furrr seems to be a good alternative right now?

lorenzwalthert commented 5 years ago

Yes, progress bars are an issue. We could first also implement a verbose argument as suggested in #375 and turn verbose off for multicore support.

We need to evaluate if it's worth to enable multiprocess on Windows.

Why does the operating matter here and would you prefer future::multicore (forked process, not supported on Windows) over future::multisession?

Using furrr sounds like a good plan to me. The only thing I am not sure about is how to choose the strategy. Because in the help file, it says:

Please refrain from modifying the future strategy inside your packages / functions, i.e. do not call plan() in your code. Instead, leave the control on what backend to use to the end user. This idea is part of the core philosophy of the future framework - as a developer you can never know what future backends the user have access to. Moreover, by not making any assumptions about what backends are available, your code will also work automatically with any new backends developed after you wrote your code.

If you think it is necessary to modify the future strategy within a function, then make sure to undo the changes when exiting the function. This can be done using:

oplan <- plan() on.exit(plan(oplan), add = TRUE) [...]

So not calling plan in styler code would mean that the user has to call it if he wants to turn on parallel processing, right? Is that the preferred option I guess?

lorenzwalthert commented 5 years ago

I did some more investigation on this.

furrr (which uses future under the hood) might indeed be a good solution, but we can also just use https://github.com/HenrikBengtsson/future.apply and https://github.com/HenrikBengtsson/progressr for progress bars.
However, there is a bigger problem. The usage of the strategy multicore is not possible on windows, leaving only multisession and cluster as parallel options for windows users. However, they both don't play with withr::with_dir(). Essentially, they don't respect the working directory. Do you think we should file an issue in the future repo?

get_wd_from_temp_dir <- function() {
  withr::with_dir(tempdir(), {
    future.apply::future_sapply(1:2, function(x) {
      print(getwd())
    })
  })
}

future::plan(future::sequential)
get_wd_from_temp_dir() # works
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [2] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
future::plan(future::multisession)
getwd()
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
get_wd_from_temp_dir() # does not work
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [2] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"

^{Created on 2019-08-03 by the reprex package (v0.3.0)}

lorenzwalthert commented 5 years ago

We could probably work around that by converting paths to absolute paths at some point and not use withr::with_dir(), but I am not sure this is a good way to solve the problem. The working directory can be set in low-level function makeClusterPSOCK() but I don't know how to pass values there when just using plan(multisession).

RichardJActon commented 5 years ago

I was going to say have you considered plan(multiprocess) which runs multi-core or multithreaded where available and multi-session where not, but I ran the below to check on the behaviour for passing variables to the child process and apparently it does not do this in Rstudio but falls back on multisession.

If you can define a working dir variable in package scope it should get passed to a child process in which it is referenced by default with some caveats it can also be done explicitly see

tmpDir <- tempdir()

get_wd_from_temp_dir <- function() {
    future.apply::future_sapply(1:2, function(x) {
        withr::with_dir(tmpDir, {
            print(getwd())
        })
    })
}

future::plan(future::sequential)

get_wd_from_temp_dir()
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1" "/tmp/RtmpfOVKI1"

future::plan(future::multiprocess)
#> Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
#> in future (>= 1.13.0) when running R from RStudio, because it is
#> considered unstable. Because of this, plan("multicore") will fall
#> back to plan("sequential"), and plan("multiprocess") will fall back to
#> plan("multisession") - not plan("multicore") as in the past. For more
#> details, how to control forked processing or not, and how to silence this
#> warning in future R sessions, see ?future::supportsMulticore

get_wd_from_temp_dir()
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1" "/tmp/RtmpfOVKI1"

^{Created on 2019-08-25 by the reprex package (v0.3.0)}

lorenzwalthert commented 4 years ago

I opened an issue: https://github.com/HenrikBengtsson/future.apply/issues/50

pat-s commented 4 years ago

Regarding parallel backends: Consider the callr backend. It solves many issues which multicore/multisession face.

HenrikBengtsson commented 4 years ago

As @RichardJActon suggests in https://github.com/r-lib/styler/issues/277#issuecomment-524662969, you could do:

withr::with_dir(tempdir(), {
  pwd <- getwd()
  future.apply::future_sapply(1:2, function(x) {
    setwd(pwd)
    print(getwd())
  })
})

This will work with all future backends that run on the localhost (e.g. multicore, multisession, future.callr::callr, future.batchtools::batchtools_local) and should produce an error other types of backends that don't have access to the local filesystem of the main R session. To robustify it further, test also for localhost, e.g.

You can robustify this and provide more informative error messages by using:

get_wd <- function() {
  structure(getwd(), hostname = Sys.info()[["nodename"]])
}

set_wd <- function(pwd) {
  target_hostname <- attr(pwd, "hostname")
  if (is.null(target_hostname)) {
    stop("Cannot change directory. Argument 'pwd' lacks attribute 'hostname'.")
  }
  hostname <- Sys.info()[["nodename"]]
  if (!identical(target_hostname, hostname)) {
    stop(sprintf("Cannot change directory on %s. Target directory %s is on another machine (%s).", sQuote(hostname), sQuote(pwd), target_hostname))
  }
  if (!utils::file_test("-d", pwd)) {
    stop(sprintf("No such directory on %s: %s", sQuote(hostname), sQuote(pwd)))
  }
  setwd(pwd)
}

and then

withr::with_dir(tempdir(), {
  pwd <- get_wd()
  future.apply::future_sapply(1:2, function(x) {
    set_wd(pwd)
    print(getwd())
  })
})

e.g.

Error in set_wd(pwd) : 
  Cannot change directory on ‘remote.machine.org’. Target directory '/tmp/hb/Rtmp8JlYCf' is on another machine ('alice-laptop').

For a full explanation, see https://github.com/HenrikBengtsson/future/issues/363#issuecomment-595446071.

lorenzwalthert commented 4 years ago

Thanks for taking time fur such a detailed answer @HenrikBengtsson. As of now, style_dir() and style_pkg() work like this:

1) change to root directory to style. 2)

# simplified
for (file in files_to_style) {
  # read content
  # style content
  # write content
  # print status of styling
}

To make styling work with remote workers, I think we had to change this to:

1) change to root directory to style. 2) read all content before calling a future backend

out <- future.apply::future_sapply(file_contents, 
  function() {
    # style content
    # communicate before writing back? or only later
})

3) writing back contents, potentially communicate styling results.

I think I don't like the fact that we can write files only at the very end of the process (please correct me if this assumption is wrong). I think it's nice when you style many files and you see progress (on the console and with git diff) as you go. Also, I don't know who would want to distribute a task like styling files on a cluster and it's almost always done interactively 😄. So, the simplest solution would probably be to:

Only allow strategies to resolve on localhost. Is there a way to check if the currently active strategy is going to be resolved on the localhost or not except from enumerating (which is dangerous because in the future there might be more backends that are local we won't add them to our list)? I only found future:::is_localhost().
If checking the above is easy: If the current strategy is not going to resolve futures on to localhost, temporarily change the strategy to sequential (Is this a good fallback?), emit a warning, proceed and set the initial strategy on exit as described in ?plan.

This essentially would boil down to what @HenrikBengtsson suggested in https://github.com/r-lib/styler/issues/277#issuecomment-595879622.

Just one edge case: What if a remote machine has the same node name and the directory to change to exists, will it fail or not? Because I think it should fail. Otherwise, it will have side effects on the remote machine (i.e. styling the files there instead of the localhost I think).

HenrikBengtsson commented 4 years ago

Is there a way to check if a strategy is going to be resolved on the localhost or not?

No. The reason is that such features need to come in at the design level (as I mention in the corresponding future issue). That is major work to solve. I don't want to open up for half-baked thing before because then we're ending up with too many hacks and a dead end in the long run.

I found future:::is_localhost()

That's completely different and definitely not meant for others to use.

Just one edge case: What if a remote machine has the same node name and the directory to change to exists, will it fail or not?

You'd need to add even more protections, e.g. write a unique file on master and verify that the worker sees it.

RichardJActon commented 4 years ago

I just tried a quick implementation of this in my fork https://github.com/RichardJActon/styler passing the directory to the workers and changing this line: https://github.com/RichardJActon/styler/blob/4a28e70617aaafbe9713c465983e562d9da8ee5f/R/transform-files.R#L23 to the furrr version which seems to work fine. The progress only shows up in chunks as the workers return though.

lorenzwalthert commented 4 years ago

Nice. Interested in a PR? Yeah I think that's not so desirable. Any idea how to fix it? Also, why furrr and not future.apply directly?

pat-s commented 4 years ago

Also, why furrr and not future.apply directly?

Just my 2 cents while watching: {furrr} hasn't seen a commit in 2 years and I'd rely on {future.apply}, especially for pkg use. There is nothing that {future.apply} cannot do what {furrr} can.

RichardJActon commented 4 years ago

Using future.apply instead is fine, I just use furrr a fair bit day-to-day for the drop-in purrr::map* replacements. I'm not really sure where to start on getting progress to show as individual files are completed - I suspect that it would be pretty complex. I could do a PR, i'm a first time contributor here so anything in particular I need to do before I open one? Also should I drop an example of using the parallelised versions in the examples for the style_pkg & style_dir docs something like:

library(future)
plan(multisession, workers = 4)
style_pkg()

lorenzwalthert commented 4 years ago

Some new updates to {progressr} that are relevant here too: https://twitter.com/henrikbengtsson/status/1260336782421323777

lorenzwalthert commented 3 years ago

@krlmlr any progress on that with the {callr} approach?

krlmlr commented 3 years ago

Not yet.

r-lib / styler

Parallelise styling #277