ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

zombies #116

Closed CarlBoneri closed 7 years ago

CarlBoneri commented 7 years ago

Not sure if you've had issues with zombie-processes being left behind on linux systems, but per our stackexchange thread, I thought this might be useful for the package. Note that it would require inline package to be installed.

#' Parallel cleanup of zombie process
#'
#'
#'
#' \code{fork.kill_zombies}
#'
#'
#' After running one or many concurrent forks of a process, there
#'  are 'zombie' processes left on the server's kernal side. Although
#'  these are not harmful they can create overlap or debug issues working
#'  within R. This function is called after the parallel call has finished
#'  or exits its run and is meant as nothing more than a clean-up.
#'  The up-side being that all zombie processes created within the loop
#'  are then removed in place.
#'
#'
#'
#' @family Parallel processing
#'
#'
#' Credit where credit is due:
#' @references
#'
#' \strong{stackoverflow} \emph{r-parallel-computing-and-zombie-processes}
#'
#' @examples
#'
#' gend_dir <- "/home/*/data/gender/by_year"
#'
#' gend_files <- list.files(gend_dir, full.names = TRUE,
#'                          recursive = TRUE,
#'                          pattern = "[0-9]{4}\\.txt")
#'
#' > system.time({
#'  gender_year_data <<- parallel::mclapply(
#'        gend_files, gendr.year_parser2,
#'        mc.cores = 18L
#'      )
#'   })
#'
#' user  system elapsed
#' 5.008   2.856   0.924
#'
#' ## This fork leaves behind 9 zombies
#'
#' > system('ps -u *', intern = F)
#'  PID TTY          TIME CMD
#'  4029 ?        00:01:40 rsession
#'  4919 ?        00:00:00 rsession <defunct>
#'  4920 ?        00:00:00 rsession <defunct>
#'  4921 ?        00:00:00 rsession <defunct>
#'  4922 ?        00:00:00 rsession <defunct>
#'  4923 ?        00:00:00 rsession <defunct>
#'  4924 ?        00:00:00 rsession <defunct>
#'  4926 ?        00:00:00 rsession <defunct>
#'  4933 ?        00:00:00 rsession <defunct>
#'  4934 ?        00:00:00 rsession <defunct>
#'  4937 ?        00:00:00 sh
#'  4938 ?        00:00:00 ps
#'
#'  > fork.kill_zombies()
#'  > system('ps -u *', intern = F)
#'
#'  # ALL GONE!
#'  PID TTY          TIME CMD
#'  4029 ?        00:01:43 rsession
#'  5000 ?        00:00:00 sh
#'  5001 ?        00:00:00 ps
#'
#'
#'
fork.kill_zombies <- function(...){

  includes <- '#include <sys/wait.h>'
  code <- 'int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};'

  wait <- inline::cfunction(
    body = code,
    includes = includes,
    convention='.C'
  )

  invisible(wait())
}
wlandau-lilly commented 7 years ago

@CarlBoneri great idea. When my tests fail, I sometimes do get warnings in R about zombie processes. However, I hesitate to build this in right away because I am worried about platform dependence. Here is the result of running fork.kill_zombies() on Windows 7 with 32-bit R-devel (r73342, Rtools34.exe).

file1b7473216171.cpp:3:22: fatal error: sys/wait.h: No such file or directory
 #include <sys/wait.h>
                      ^
compilation terminated.
make: *** [file1b7473216171.o] Error 1

ERROR(s) during compilation: source code errors or compiler configuration errors!

Program source:
  1: #include <R.h>
  2: 
  3: #include <sys/wait.h>
  4: 
  5: extern "C" {
  6:   void file1b7473216171 (  );
  7: }
  8: 
  9: void file1b7473216171 (  ) {
 10: int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};
 11: }
Error in compileCode(f, code, language, verbose) : 
  Compilation ERROR, function(s)/method(s) not created! file1b7473216171.cpp:3:22: fatal error: sys/wait.h: No such file or directory
 #include <sys/wait.h>
                      ^
compilation terminated.
make: *** [file1b7473216171.o] Error 1
In addition: Warning message:
 Show Traceback

 Rerun with Debug
 Error in compileCode(f, code, language, verbose) : 
  Compilation ERROR, function(s)/method(s) not created! file1b7473216171.cpp:3:22: fatal error: sys/wait.h: No such file or directory
 #include <sys/wait.h>
                      ^
compilation terminated.
make: *** [file1b7473216171.o] Error 1 
wlandau-lilly commented 7 years ago

Not sure, but processx may be able to help too.

wlandau-lilly commented 7 years ago

I had a look at the forums, beginning with this SO thread and the ones linked at the top. I have decided that I do not want drake to be too opinionated on this topic. After all, the needs of a good cleanup depend highly on the parallel backend. What works for mclapply will most certainly not work for SLURM. However, I have tried to address the issue in two ways.

  1. Document zombie processes in the caution and parallelism vignettes.
  2. Make sure the PSOCK cluster in the parLapply backend always cleans up, even if make() fails.

Given that drake needs to work on all platforms and all parallel backends, I think this is the best we can do.

@CarlBoneri thank you for bringing up zombies.

CarlBoneri commented 7 years ago

No problem. I don't think there are zombie processes on Windows?

kendonB commented 7 years ago

@CarlBoneri there are pseudo-zombie processes on Windows. It's not completely predictable, but it often happens when parLapply fails or the user assigns a second cluster object to the same object name (as in cl <- makeCluster(2); cl <- makeCluster(2).

In my experience, they always die with the parent process (hence the pseudo).

wlandau-lilly commented 7 years ago

Zombies from a parLapply() failure clean up easily with on.exit(stopCluster()), which is why I was glad to see this issue on the tracker. I did not know the bit about cl <- makeCluster(2); cl <- makeCluster(2).

Cleanup might be easier with a function like with_cluster():

with_cluster <- function(cl, code){
  withCallingHandlers(
    code,
    error = function(e){
      parallel::stopCluster(cl)
      stop(e)
    }
  )
}

See r-lib/withr#59. Thanks for the idea, @kendonB.