mkao006 / r_style_fao

This is the repository for the R programming guideline of the Food and Agricultural Organization of the United Nations.
4 stars 1 forks source link

plyr instead of for? #3

Open malexan opened 10 years ago

malexan commented 10 years ago

To achieve more elegant code I would persuade users into using of plyr's functions instead of 'for' cycle.

Additional bonus from plyr:

mkao006 commented 10 years ago

I am not sure which function in plyr you are referring, maybe you are refering to the foreach package?

Hadley's package are great tools for beginners, as they are simple and generate consistent output. However, the drawback is that they are generally inefficient.

malexan commented 10 years ago

Yes, about foreach: any **ply function from plyr package can run in parallel with foreach in backend.

Don't know about the inefficient in plyr package because of didn't compare. But there are some components in plyr what written in C.

mkao006 commented 10 years ago

Yes, unfortunately there are some work that can't be paralleled and for loop are essential. The example is an implementation of the EM-algorithm which can not be split apart since its sequential. We are looking into writing some of this using Rcpp.

Here is some toy benchmarking example I did a while ago where data.table is around 4 times the speed of ddply. There are also some other benchmarking around the web and one of Hadley's goal is to make **ply as fast as data.table.

I would still however training the staff with **ply and data.frame first.

########################################################################

Title: Benchmark applying sum functions

Date: 2013-02-16

########################################################################

library(data.table) library(plyr) library(FAOSTAT) test.df = getFAO(name = "Value") test.dt = data.table(test.df) setkey(test.dt, FAOST_CODE, Year)

dtSum = function(dt){ dt[, j = sum(Value), by = "Year"] }

aggSum = function(df){ aggregate(df$Value, by = list(df$Year), FUN = sum) }

plyrSum = function(df){ ddply(.data = df, .variables = .(Year), .fun = function(x) sum(x$Value)) }

tapSum = function(df){ tapply(df$Value, df$Year, sum) }

benchmark(dtSum(test.dt), aggSum(test.df), plyrSum(test.df), tapSum(test.df), order = "relative", replications = 100)

mkao006 commented 10 years ago

Also, data.table is extremely efficient in memory usage.

mkao006 commented 10 years ago

http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/

Here is a slightly more comprehensive review, I think it would be great if you can train them in **ply and Hadley's tools since they are generally more elegant and consistent then move to other tools when efficiency are required.

malexan commented 10 years ago

Yes, we can take plyr, reshape2, stringr, lubridate, httr.

There is also very perspective for efficient code package testthat, but I've not enough experience with it yet.