Closed kajomano closed 7 years ago
A quick note: I wouldn't expect tracemem
to report when an object is copied in a forked process.
The problem here isn't with R6 specifically, but with the way environments are used with parLapply
. Here's a version of your code that doesn't use R6 at all, but instead uses the environments created when a function is called:
library(parallel)
# Create test data large enough so that it easily shows up in the system monitor
n.row <- 1e6
n.col <- 250
test.data <- matrix( rnorm( n.row * n.col ), ncol = n.col )
make_instance <- function(data = NULL) {
x <- data
cl <- NULL
run <- function() {
parLapply(cl, 1:3, function(node) { NULL })
}
start <- function(){
cl <<- makeForkCluster(3, useXDR = FALSE)
}
stop <- function(){
stopCluster(cl)
}
# Return the environment created when this function was invoked
environment()
}
instance <- make_instance(test.data)
instance$start()
instance$run() # Slow, and consumes a lot of memory
instance$stop()
The call to instance$run()
take a long time, and it results in a lot of memory allocation, similar to your R6 version above.
The cause of the problem is described here: http://stackoverflow.com/a/35857490/412655. The function that you are passing to parLapply
is serialized and sent to the workers. When serialized, the function includes its ancestor environments. In my version of the code, the x
object is in an ancestor environment of the function, and therefore is copied. (Because an R6 object is essentially a set of of environments, the same thing happens with R6.) You can see this if you change the function to return the set of parent environments using pryr::parenvs()
:
make_instance <- function(data = NULL) {
x <- data
cl <- NULL
run <- function() {
parLapply(cl, 1:3, function(x) pryr::parenvs())
}
start <- function(){
cl <<- makeForkCluster(3, useXDR = FALSE)
}
stop <- function(){
stopCluster(cl)
}
environment()
}
instance <- make_instance(test.data)
instance$start()
instance$run()
# [[1]]
# label name
# 1 <environment: 0x102c08840> ""
# 2 <environment: 0x102c08878> ""
# 3 <environment: 0x102c088b0> ""
# 4 <environment: R_GlobalEnv> ""
#
# [[2]]
# label name
# 1 <environment: 0x102c65ed8> ""
# 2 <environment: 0x102c65f10> ""
# 3 <environment: 0x102c65f48> ""
# 4 <environment: R_GlobalEnv> ""
#
# [[3]]
# label name
# 1 <environment: 0x102cac1b8> ""
# 2 <environment: 0x102cac1f0> ""
# 3 <environment: 0x102cac228> ""
# 4 <environment: R_GlobalEnv> ""
instance$stop()
If you were to inspect those environments, you would find your test data (stored as x
) somewhere in them.
To avoid this, your function should be defined outside, so it doesn't inherit the data in one of its ancestor environments:
myfunc <- function(x) { NULL }
make_instance <- function(data = NULL) {
x <- data
cl <- NULL
run <- function() {
parLapply(cl, 1:3, myfunc)
}
start <- function(){
cl <<- makeForkCluster(3, useXDR = FALSE)
}
stop <- function(){
stopCluster(cl)
}
environment()
}
instance <- make_instance(test.data)
instance$start()
instance$run() # Runs fast now
instance$stop()
You might have noted that the global environment is an ancestor environment to all of the environments here, and because it contains the test.data
object, serializing it should also take a long time. Apparently parLapply
doesn't serialize the global environment for each worker. I don't know enough about the parallel package to say why it's implemented that way. See this for a bit more: http://stackoverflow.com/a/18037825/412655
Thank you for the exhaustive answer, you spared me at least a day of research. Very helpful!
Hello,
I'm trying to use your package with a parLapply() call inside one of the member functions. I'm using a FORK cluster that implements shared memory between the processes, meaning that if a variable is not modified and only read in the parLapply() call, it should not be copied to the local memory of the child processes. However, the R6 object instance is hard copied to the children whenever you do something with them (every time you call parLapply()). Here is a reproducible example (watch out, FORK clusters only work on unix-like operating systems):
As you can see a lot of memory gets allocated, even though the variables weren't even touched. This problem is the same as reported in here: stackoverflow. I consider this as a bug as it's against what shared memory was designed for.