r-lib / R6

Encapsulated object-oriented programming for R
https://R6.r-lib.org
Other
410 stars 56 forks source link

R6 instances are copied using FORK shared memory parallelization #115

Closed kajomano closed 7 years ago

kajomano commented 7 years ago

Hello,

I'm trying to use your package with a parLapply() call inside one of the member functions. I'm using a FORK cluster that implements shared memory between the processes, meaning that if a variable is not modified and only read in the parLapply() call, it should not be copied to the local memory of the child processes. However, the R6 object instance is hard copied to the children whenever you do something with them (every time you call parLapply()). Here is a reproducible example (watch out, FORK clusters only work on unix-like operating systems):

library( "R6" )
library( "parallel" )

# Create test data large enough so that it easily shows up in the system monitor
n.row <- 10^6
n.col <- 250
test.data <- matrix( rnorm( n.row * n.col ), ncol = n.col )
tracemem( test.data )

# Creating an R6 class
test.class <- R6Class(
  "test",
  public = list(
    x = NULL,
    cl = NULL,

    initialize = function( x ){
      self$x <- x
    },

    start.parallel = function(){
      self$cl <- makeForkCluster( 3, useXDR = FALSE )
    },

    run.parallel = function(){
      parLapply( self$cl, 1:3, function(node){
        return()
      })
    },

    stop.parallel = function(){
      stopCluster( self$cl )
    }
  )
)

# Create test instance
test.instace <- test.class$new( test.data )

# Create cluster
test.instace$start.parallel()

# Run the parallel computation, watch the memory consumed in the system monitor
# This call consumes a huge amount of memory, even though tracemem() does not
# report the 'test.data' variable being copied
test.instace$run.parallel()

# Stop the cluster
test.instace$stop.parallel()

parallel_bug As you can see a lot of memory gets allocated, even though the variables weren't even touched. This problem is the same as reported in here: stackoverflow. I consider this as a bug as it's against what shared memory was designed for.

wch commented 7 years ago

A quick note: I wouldn't expect tracemem to report when an object is copied in a forked process.

The problem here isn't with R6 specifically, but with the way environments are used with parLapply. Here's a version of your code that doesn't use R6 at all, but instead uses the environments created when a function is called:

library(parallel)

# Create test data large enough so that it easily shows up in the system monitor
n.row <- 1e6
n.col <- 250
test.data <- matrix( rnorm( n.row * n.col ), ncol = n.col )

make_instance <- function(data = NULL) {
  x <- data
  cl <- NULL

  run <- function() {
    parLapply(cl, 1:3, function(node) { NULL })
  }
  start <- function(){
    cl <<- makeForkCluster(3, useXDR = FALSE)
  }
  stop <- function(){
    stopCluster(cl)
  }

  # Return the environment created when this function was invoked
  environment()
}

instance <- make_instance(test.data)
instance$start()
instance$run()  # Slow, and consumes a lot of memory
instance$stop()

The call to instance$run() take a long time, and it results in a lot of memory allocation, similar to your R6 version above.

The cause of the problem is described here: http://stackoverflow.com/a/35857490/412655. The function that you are passing to parLapply is serialized and sent to the workers. When serialized, the function includes its ancestor environments. In my version of the code, the x object is in an ancestor environment of the function, and therefore is copied. (Because an R6 object is essentially a set of of environments, the same thing happens with R6.) You can see this if you change the function to return the set of parent environments using pryr::parenvs():

make_instance <- function(data = NULL) {
  x <- data
  cl <- NULL

  run <- function() {
    parLapply(cl, 1:3, function(x) pryr::parenvs())
  }
  start <- function(){
    cl <<- makeForkCluster(3, useXDR = FALSE)
  }
  stop <- function(){
    stopCluster(cl)
  }

  environment()
}

instance <- make_instance(test.data)
instance$start()
instance$run()
# [[1]]
#   label                      name
# 1 <environment: 0x102c08840> ""
# 2 <environment: 0x102c08878> ""
# 3 <environment: 0x102c088b0> ""
# 4 <environment: R_GlobalEnv> ""
# 
# [[2]]
#   label                      name
# 1 <environment: 0x102c65ed8> ""
# 2 <environment: 0x102c65f10> ""
# 3 <environment: 0x102c65f48> ""
# 4 <environment: R_GlobalEnv> ""
# 
# [[3]]
#   label                      name
# 1 <environment: 0x102cac1b8> ""
# 2 <environment: 0x102cac1f0> ""
# 3 <environment: 0x102cac228> ""
# 4 <environment: R_GlobalEnv> ""
instance$stop()

If you were to inspect those environments, you would find your test data (stored as x) somewhere in them.

To avoid this, your function should be defined outside, so it doesn't inherit the data in one of its ancestor environments:

myfunc <- function(x) { NULL }

make_instance <- function(data = NULL) {
  x <- data
  cl <- NULL

  run <- function() {
    parLapply(cl, 1:3, myfunc)
  }
  start <- function(){
    cl <<- makeForkCluster(3, useXDR = FALSE)
  }
  stop <- function(){
    stopCluster(cl)
  }

  environment()
}

instance <- make_instance(test.data)
instance$start()
instance$run()    # Runs fast now
instance$stop()

You might have noted that the global environment is an ancestor environment to all of the environments here, and because it contains the test.data object, serializing it should also take a long time. Apparently parLapply doesn't serialize the global environment for each worker. I don't know enough about the parallel package to say why it's implemented that way. See this for a bit more: http://stackoverflow.com/a/18037825/412655

kajomano commented 7 years ago

Thank you for the exhaustive answer, you spared me at least a day of research. Very helpful!