r-lib / R6

Encapsulated object-oriented programming for R
https://R6.r-lib.org
Other
404 stars 56 forks source link

Serializing R6 classes? #157

Open kenarab opened 5 years ago

kenarab commented 5 years ago

Hello. cc @leobelen We are developing a package Rpolyhedra. Rpolyhedra is a polyhedra database scraped from internet public available sources. It makes polyhedron R6 objects from scraped sources. A ledger of the scraping process and the database itself are R6 objects.

Then we put everything in an RDS file, for speedup, regression testing, setting up a pre-scraped full version of the database, and other features.

@wch asked us for not including a R6 object in a RDS file because of incompatibility between different installed versions of R6 in the final users computers.

We think the solution is to serialize the R6 object when saving it to RDS. So, users with different R6 versions can access to the same RDS without risk of incompatibilities.

I read the R6 code and it is not simple for me evaluating how to apply metaprogramming for accessing the fields (state) of the objects, lesser to find out how to warranty the serialization/deserialization could be secure.

We started to doing it with our own serializing methods, but wondering if there is consensus and a roadmap for this feature would be resolved by the package itself.

I saw there is a method as.list.R6 #91 and internally used get_nonfunctions that could be useful. But not sure if we will be able to propose a fancy and community satisfying solution. Maybe this issue should be solved by a well versed R6 philosophy developer if there exists the consensus on the value of the proposed feature.

wch commented 5 years ago

R6 takes special care to make it so that the stored R6 object should function fine even on a computer with a different version of R6. However, the objects themselves are not guaranteed to be identical in structure across versions of R6.

The problem in Rpolyhedra (qbotics/Rpolyhedra#21) is that you are comparing a stored R6 object to one that is generated dynamically. The stored object may have been created on a different machine with a different version of R6, and so the resulting object may not be identical to the dynamic one, even if all the inputs are the same. In that issue, the clone() method from R6 changed, which is what caused the comparison to fail.

I think you need to be specific in what you expect from this serialization. The serialize() function will serialize the object (it is what saveRDS() uses), but it isn't suitable for your purposes for the reason I described above.

d-sharpe commented 4 years ago

@wch I've been working on a R6 based package, which requires me to save the 'state' of the objects at certain points. Some of the objects are quite nested with lists of R6 objects inside others. When created the in-memory size is measured in 10s of MB. When saved (via save, saveRDS, or serialize) and then reloaded the memory requirements explode to > GBs. Debugging and profiling and pryr::inspect suggests (to me at least), that the functions attached to each R6 object instance are being created for each instance. I believe this is related to ropensci/drake#383. Any ideas on a better/workaround save method for the R6 classes?

kenarab commented 4 years ago

@wch we timely considered your observations and implemented a solution which bypasses the problem.

@d-sharpe I'm sorry but I don't know how to help you.

Let's wait @wch answer.

wch commented 4 years ago

@d-sharpe If you can provide a small reproducible example, that would help me understand exactly what you're facing.

d-sharpe commented 4 years ago

@wch Thanks for the response. The example below seems to illustrate what I'm seeing. I see more expansion with more functions attached to the classes and more complexity in those functions.

library(R6)

classA <-
  R6Class(
    classname = "classA",
    class = TRUE,
    cloneable = FALSE,
    public = list(
      initialize = function(n = 100) {
        private$collectionOfB <-
          lapply(seq_len(n), function(x) {
            classB$new(n = n)
          })
      },
      getInstanceOfB = function(item) {
        return(private$collectionOfB[[item]])
      },
      getNumberOfInstances = function(...) {
        return(length(private$collectionOfB))
      },
      fun1 = function(...) {
        return(runif(1))
      }
    ),
    private = list(collectionOfB = list())
  )

classB <-
  R6Class(
    classname = "classB",
    class = TRUE,
    cloneable = FALSE,
    public = list(
      initialize = function(n = 100) {
        private$collectionOfC <-
          lapply(seq_len(n), function(x) {
            classC$new(n = n)
          })
      },
      getInstanceOfC = function(item) {
        return(private$collectionOfC[[item]])
      },
      getNumberOfInstances = function(...) {
        return(length(private$collectionOfC))
      },
      fun1 = function(...) {
        return(runif(1))
      }
    ),
    private = list(collectionOfC = list())
  )

classC <-
  R6Class(
    classname = "classC",
    class = TRUE,
    cloneable = FALSE,
    public = list(
      initialize = function(n = 20) {
        private$values <- rnorm(n)
      },
      getValues = function(...) {
        return(private$values)
      },
      fun1 = function(...) {
        return(runif(1))
      },
      fun2 = function(...) {
        return(runif(1))
      },
      fun3 = function(...) {
        return(runif(1))
      },
      fun4 = function(...) {
        return(runif(1))
      },
      fun5 = function(...) {
        return(runif(1))
      }
    ),
    private = list(values = numeric(0L))
  )

x <- classA$new()

library(pryr)

object_size(x)
## 25.6 MB

x_copy <-
  unserialize(serialize(x, connection = NULL))

object_size(x_copy)
## 146 MB
wch commented 4 years ago

I think the size increase probably happens because serialize and unserialize aren't smart enough to deduplicate the components of the functions that are the same (that is, the body and formals). For example:

Here's a simple example (without R6) that illustrates:

x <- lapply(1:1000, function(i) {
  function() i
})

object_size(x)
#> 355 kB

x_copy <- unserialize(serialize(x, version = 3, connection = NULL))
object_size(x_copy)
#> 2.07 MB
d-sharpe commented 4 years ago

Thanks for the looking at this @wch. It seems to apply to non-functions too:

x <- rnorm(1e5)

list_of_x <-
  list(x, x)

pryr::object_size(x)
#> 800 kB

pryr::object_size(list_of_x)
#> 800 kB

list_of_x_copy <- unserialize(serialize(list_of_x, version = 3, connection = NULL))

pryr::object_size(list_of_x_copy)
#> 1.6 MB

I wrote a set of workaround functions pickleR (still pretty basic). Gets state of R6 instances in conjunction with #197 and keeps track of object memory addresses as it transverses the object chain, and restores them maintaining references.

list_of_x_pickle <-
  pickleR::unpickle(pickleR::pickle(list_of_x, connection = NULL))

pryr::object_size(list_of_x_pickle)
#> 800 kB

x <- lapply(1:1000, function(i) {
  function() i
})

pryr::object_size(x)
#> 367 kB

x_copy <- unserialize(serialize(x, version = 3, connection = NULL))
pryr::object_size(x_copy)
#> 2.08 MB

x_pickle <-
  pickleR::unpickle(pickleR::pickle(x, connection = NULL))

pryr::object_size(x_pickle)
#> 232 kB
# I believe the size is smaller because the pickle functions
# on take the immediate enclosing environment of the function
# and reconstitute is with an emptyenv() as its parent.

Works against R6 classes too (smaller version of my previous R6 example):

pryr::object_size(R6_x)
#> 1.28 MB

R6_x_copy <-
  unserialize(serialize(R6_x, connection = NULL))

pryr::object_size(R6_x_copy)
#> 17.8 MB

R6_x_pickle <-
    pickleR::unpickle(pickleR::pickle(R6_x, connection = NULL))

pryr::object_size(R6_x_pickle)
#> 1.28 MB

It is at least and order of magnitude slower to 'pickle' the nested classes, but thats not a critical blocker for me at the minute

wch commented 4 years ago

@d-sharpe I took a quick look at your pickleR package. I think you could write your own serialization/deserialization functions that could work on any object, not just R6 objects that have been customized with the get_state and set_state functions.

The lobstr package's size-computing code is implemented in C++ and might provide some useful guidance: https://github.com/r-lib/lobstr/blob/master/src/size.cpp