Team discussion on OOP system in R

yuchuan2016 commented 7 years ago

@gentzkow wants us to put our heads together to discuss which OOP system we want to use to create our gslab_r packages. Our ultimate goal is to migrate the MATLAB classes gslab_model, gslab_mle and gslab_mde to R packages. The discussion should get everyone familiar with the OOP system in R, and decide which system we want to use to create our packages.

I will briefly summarize what I have learned regarding to different systems in R and my current implementation of gslab_model in my next comment. We should look at the whole gslab_model, gslab_mle, and gslab_mde system and think about which features will be hard to implement in one or the other.

yuchuan2016 commented 7 years ago

There are at least three different approaches to object oriented programming in R: S3, S4, and RC (Reference classes, or R5). Some useful references for these systems are below:

S3 and S4 are similar, in the sense that methods belong to generic functions, instead of classes. In S3, you first create a class by defining a function, where you specify a list of fields and assign the object a class name. Next you create a generic function, and then different methods based on the class of its inputs. For example, plot is a S3 generic function. When you type methods(plot), it will display different methods of plot:

Then when you call plot(object), it will first determine the class of the object, and then use method dispatch to call the right method plot.xx associated with the generic function. See this gist as an example of the class ModelData in S3 system.

S4 system has more formal definitions of fields and inheritance structure. It uses setClass to define a class, instead of creating a function and assigning the object a class name. It has slots (parallel to properties in MATLAB) and contains (parallel in < in MATLAB) in class definition. But methods also belong to generic functions. You need to add signature in each specific method to tell the generic function that this method is called if and only if the input object belongs to the particular class. See this as an example of the class ModelData in S4 system.

RC system is more similar to other OOP languages, where methods belong to classes. Another major difference is that RC objects are mutable, so if you set b=a and modify b, a will also be modified. And if you call some method, even if the method throws some error, the underlying object can be already modified. See this as an example of the class ModelData in RC system.

There is actually another system R6, but @gentzkow thinks it sounds like not heavily used / maintained. Between S3 and S4, I personally prefer S4, since it has more formal structures and not substantively more code, and I'm a little worried about the inheritance in S3. I have created the three basic classes in gslab_model (ModelData, Model, and ModelEstimationOutput) and the corresponding unit tests in both S4 and RC systems, currently in the issue2-S4 and issue2-RC branches.

Some principles to pick a system is here. @arosenbe @stanfordquan @M-R-Sullivan @Shun-Yang , when you have time, can you go over my brief introduction and other references, as well as the MATLAB gslab* classes and R classes I created in two branches, to see if you have any thoughts on the pros/cons of these systems? Particularly, since we want to maintain our previous structure, one possible comparison would be which features will be hard to implement in one or the other. Thanks!!

@gentzkow , do you want take a look at above to see if it's what in your mind?

gentzkow commented 7 years ago

Thanks @yuchuan2016 et al.

To be clear, we want to focus on considering what will be best for porting over the GSLabModel framework not what will be best in the abstract. So the focus should be on thinking through the way those classes work and evaluating what will allow us to reproduce that functionality as seamlessly and elegantly as possible.

M-R-Sullivan commented 7 years ago

@yuchuan2016 - My preference would be S4, although I think S3 could also work. I like S4 because:

RC's shallow copying could lead to mistakes and frustration, especially when we program in the functional style that I think is best suited for these model classes;
Inheritance is more explicit in S4 than S3. The GSLab model framework could get a bit confusing if we didn't fully enumerate class hierarchies as it seems S4 would force us to do.
S4 provides a good framework for checking object validity. The MATLAB implementation has functions for this purpose that we could re-write as S4 validity checks quite easily.
S4 generally seems a lot safer than S3.

qlquanle commented 7 years ago

From my limited knowledge of the systems above, I prefer S4 over S3. The slackness in S3 over object types seems quite dangerous. In the example of BLP, it seems like something along the line of the pseudocode below would be possible with S3:

blp_demand_data <- stuff

#by accident we set
class(blp_demand_data) <- "supply_data"

#the following may or may not crash the code, but you may never know
estimate_demand(blp_demand_data)

Also, multiple dispatch in S4 seems very useful. e.g. for the Sensitivity paper we can have pseudocode:

blp_demand_data <- stuff
blp_supply_data   <- more_stuff

class(blp_demand_data) <- c("blp", "demand")
class(blp_supply_data) <- c("blp", "supply")

# these two lines can dispatch different methods in S4 but they only see BLP in S3
estimate_sensitivity(blp_demand_data)
estimate_sensitivity(blp_supply_data)

let me know if I'm not making sense!

arosenbe commented 7 years ago

I'm slightly in favor of RC only because it's more pythonic. I agree that S3 doesn't provide enough structure to use the models in the way we like to think of them. So my comparison below is between RC and S4.

I started out by rigging up a representative procedure in the two systems. Assign a matrix to input, multiply all the values by 2 and flatten to a vector, and assign the vector to output.

The RC system sets up the class object similarly to python. The object is assigned a name, a list of fields (attributes), and a list of methods. Fields can be assigned when the object is instantiated, overwritten later on, and modified by methods through the <<- operator.

# RC implementation
rc <- setRefClass('rc', 
        fields = list(
          input = "matrix", 
          output = "matrix"
        ), 
        methods = list(
          estimate = function(x){
            output <<- input * 2 # Biased and inconsistent for theta
          }
        )
      )

The S4 implementation feels different. The object has a name and a list of slots (attributes). Slots can be filled by the same actions as for RC objects. The big difference, to me, is that methods belong to functions, and classes are only "told" that they have access to specific methods. They're not build in.

# S4 implementation
s4 <- setClass('s4', 
        slots = list(
          input = "matrix", 
          output = "matrix"
        )
      )
setGeneric('estimate', function(x){
  standardGeneric('estimate')
})
setMethod('estimate', 
          c(x = 'matrix'), 
          function(x){
            out <- x * 2
            return(out)
          }
        )

The other big difference between the systems is that RC edits objects in place. This is a very pythonic approach to objects. It can be a little confusing, but is faster, all else equal. However, I don't think either aspect of in place editing should have particularly outsized effect on our decision since self-sufficient model objects should never need to (i) point at one another or (ii) be redundantly instantiated multiple times within a script.

Confusion would occur in the following (pseudo-coded) situation because object b only points to whatever is stored in object a.

a = 1
b = a 
a = 2
print(b) # 2

Note that this is only an issue between objects a and b, not for attributes within an object.

a.a = 1
a.b = a.a
a.a = 2
print(a.b) # 1

In terms of speed, the two seem comparable, S4 might even be a little faster. Here's what happens when I feed in a large matrix.

library(microbenchmark)

# Sample data
big <- runif(5000)
big <- rbind(rep(big, 100000)) #3.7 GB

# RC implementation
rc_big <- rc$new(input = big)
microbenchmark(rc_big$estimate(rc_big$input))
#  min       lq       mean     median     uq      max   neval
#  1.957736 2.102043 2.519555 2.22633 3.012913 3.878841   100

# S4 implementation
s4_big <- new('s4', input = big)
microbenchmark(s4_big@output <- estimate(s4_big@input)) 
#  min       lq       mean     median     uq       max   neval
#  1.954399 2.034408 2.412718 2.345522 2.695866 3.450331   100

I feel more strongly about (the concept behind) RC after reading the second resource @yuchuan2016 links to below.

Often I have objects that represent model/data combinations for which the parameter estimates are to be determined by optimizing a criterion. In those cases it makes sense to me to use reference classes because the state of the object can be changed by a method. I want to update the parameters in the object and evaluate the estimation criterion without needing to copy the entire object.

The quote is from Douglas Bates, a member of the R core team since its inaugural year. He continues:

If you try to perform some kind of update operation on an S4 object and not cheat in some way (i.e. adhere to strict functional programming semantics) you need to create a new instance of the object each time you update it. When the object is potentially very large you find yourself worrying about memory usage if you take that route. I found that my code started to look pretty ugly because conceptually I was updating in place but the code needs to be written as replacements.

I've found some ways to minimize side effects with RC objects after spending a bit of time with them. The key aspect is that the RC implementation is an S4 class with a slot reserved for a "cached" environment. This is what makes the RC objects mutable: they operate on and call from their unique locations in memory. (It's also why methods belong to objects and not generic functions in the global environment.)

Consider the class below. It has two fields: raw_data and output. Its scale method takes in a scalar_value and fills the output field with the scaled value of raw_data.

C <- setRefClass('C', 
                  fields = list(
                    raw_data = "numeric", 
                    output = "numeric"
                  ), 
                  methods = list(
                    scale = function(scalar_value){
                      raw_data <- raw_data * scalar_value # Don't assign globally 
                      output <<- raw_data # Do assign globally
                    }
                  )
                )

The key behavior here is that the scaled value of raw_data will not overwrite the value of raw_data in the object when the scale method is called. This is because the scaled value is assigned locally within the function environment. It is not assigned globally to the environment reserved for an object of this class.

ob <- C$new(raw_data = 1)

ob$raw_data # 1
ob$output # numeric(0)

ob$scale(10)

ob$raw_data # 1
ob$output # 10

The local assignment tact only guarantees safe assignments within an object's methods. Problems can still arise when the object is passed to a function. Even within a function environment, values for an object's fields are stored in the environment reserved for the object.

# continued
fun <- function(ob){
  ob$raw_data <- 7
  ob$scale(10)
  return(1)
}

a <- fun(ob)

ob$raw_data # 7
ob$output # 70

This is the exact same way that python works, except we have finer control over which fields can be overwritten by a method through the distinction between the <<- and <- operators.

yuchuan2016 commented 7 years ago

It seems we have reached the consensus that we won’t use S3. I do not have a strong preference over S4 and RC. One reason for S4 is the in place change might cause some confusion when coding interactively. Also RC seems to be used less common in base R (partly due to the fact it was introduced in 2011). There are 400 questions in StackOverFlow under [r]+[s4] tags, and 134 questions under [r]+[reference-class]. One reason for RC is this answer, which seems to be similar to our situation.

Just for reference, setRefClass is defined using setClass.

setRefClass (click to expand)

``` function (Class, fields = character(), contains = character(), methods = list(), where = topenv(parent.frame()), inheritPackage = FALSE, ...) { fields <- inferProperties(fields, "field") info <- refClassInformation(Class, contains, fields, methods, where) superClasses <- refSuperClasses <- fieldClasses <- fieldPrototypes <- refMethods <- NULL for (what in c("superClasses", "refSuperClasses", "fieldClasses", "fieldPrototypes", "refMethods")) assign(what, info[[what]]) classFun <- setClass(Class, contains = superClasses, where = where, ...) classDef <- new("refClassRepresentation", getClassDef(Class, where = where), fieldClasses = fieldClasses, refMethods = as.environment(refMethods), fieldPrototypes = as.environment(fieldPrototypes), refSuperClasses = refSuperClasses) .setObjectParent(classDef@refMethods, if (inheritPackage) refSuperClasses else NULL, where) assignClassDef(Class, classDef, where) generator <- new("refGeneratorSlot") env <- as.environment(generator) env$def <- classDef env$className <- Class .declareVariables(classDef, where) value <- new("refObjectGenerator", classFun, generator = generator) invisible(value) } ```

@M-R-Sullivan @arosenbe , if either of you has been convinced/changed you preference, please update :grinning:

@gentzkow , based on our discussion during lunch yesterday and the above comments, do you have any preference on S4 and RC now?

gentzkow commented 7 years ago

Before I weigh in, why don't you guys agree on a summary recommendation (S4 vs RC) and a compiled list of the pros / cons.

--

Matthew Gentzkow Professor of Economics Stanford University

M-R-Sullivan commented 7 years ago

@yuchuan2016 - My thoughts:

Summary: I would prefer S4 if we plan on faithfully porting the structure of the MATLAB model libraries, and RC if we want to make classes that envelop the current data, estimation output, estimation options, etc. classes that exist for each type of model.
In the current set up, we conduct estimation with lines like est = model.Estimate(data, estopts). S4 seems better suited for this functional style. Also, S4's treatment of methods outside the class definition seems appropriate for our uses here, as our model classes are effectively estimation method definitions outside class definitions. If we decide to use S4, I would support replacing the model classes with various Estimate methods for the different data classes. Multiple dispatch could be useful here if we call these methods on instances of various data and estimation option classes.
In an alternative set-up, we could store data, estimation output, and estimation options as attributes of a single class. In this case, it would make more sense to use RC and to "fill up" instances of this class with method calls like model.estimate(), model.simulate(), etc.. Making copies like model <- model.estimate(), as we would do under S4, seems a lot less natural under this alternative set-up.
I would rather keep the current set-up and use S4, mainly because I'm a fan of the functional paradigm, but my preference isn't that strong. I think similarity to Python could be a point in favour of RC and that a larger user base is a point in favour of S4.

yuchuan2016 commented 7 years ago

@M-R-Sullivan @stanfordquan @arosenbe @Shun-Yang , thanks for all your valuable input!! I have combined your comments above. Please take a look to see if there is anything that you disagree with or missing from the summary. Feel free to edit. Then we can send to Matt as the recommendation summary. Thanks a lot!

Evaluate Object-oriented programming systems in R

We want to build our gslab_r packages in R. There are three major OOP systems in R: S3, S4 and RC. S3 is pretty slack over object types and does not have a formal definition of inheritance, which seems somewhat dangerous and doesn't provide enough structures to use the models in the way we like. So this document compares the rest two systems: S4 and RC.

Methods:

S4: Methods belong to generic functions, which is unique in R. Methods are dispatched based on classes of input.
RC: Methods belong to classes similar to other OOP languages.

Dispatch:

S4: Methods can be dispatched based on classes of multiple arguments.
RC: Methods work on the classes they belong to.

Edit:

S4: Objects are static and cannot be modified by methods. The object needs to be copy-assigned like obj <- edit(obj). It's safer in the sense that the underlying object won't be modified when methods are called, but a new instance of the object needs to be created each time it is updated, which seems less natural and can cause potential speed and memory issues when the object is large.
RC: Objects can be edited directly using methods like obj$edit, which is a very pythonic approach to objects. The attributes of the object can be updated without copying the entire object. The mutability can cause some unanticipated confusion/trouble, but can be minimized by correctly using local assignment <- and field change <<-.

Performance:

Similar in the speed of handling large datasets.

Usage:

More users use S4, but it may come from the fact that RC is only introduced in 2011.

Summary:

Based on above, we propose different ways under two systems

If we use RC, we create one class for each type of model, envelop data/estimation output/options as attributes in this class, and fill up instances of this class by calling model$estimate(), pretty similar to what we did in Python. To minimize the side effect of inplace change, we use local assignment <- when we do not want to modify some fields.
If we use S4, we create different data classes and estimation output classes, and create various methods in the Estimation function to perform on different data classes using est = Estimate(data, estopts). Multiple dispatch could be useful if we call these methods on instances of different data and estimation option classes. This different structure seems more in spirit with the fact that the model classes are effectively estimation definitions outside the class definitions.

yuchuan2016 commented 7 years ago

@gentzkow , see the above comment for the recommendation summary. More details can be found in Adam, Michael and Quan's comments.

gentzkow commented 7 years ago

Thanks @yuchuan2016, @arosenbe, @M-R-Sullivan, @stanfordquan. This is super clear.

I would vote that we settle on using RC as our default, and in particular that we use it for migrating the GSModel libraries. I suspect the benefits of consistency w/ Python and Matlab OOP will outweigh the potential disadvantages.

@yuchuan2016: When we do this migration, I suspect we will no want to combine data and estimation output into a single class (as suggested in the comment above). By default we should keep the structure identical to the Matlab version. If/when we are working on this and you see places you think we should deviate from the Matlab structure we can discuss.

I will go ahead and close this issue.

gslab-econ / gslab_r