vague discussion of speed with dplyr v. data.table

ljanda commented 5 years ago

Your post often states that dplyr is slower than data.table and uses many qualifiers eg "much, much" slower but you do not clearly define what you mean by much slower or give clear information about when it is slower. You link out to a couple examples but this information should clearly be given within the post as this is a large part of your argument.

You do pull some numbers from the h20ai site but your table is vague - it doesn't show what functions are being used or the size of the dataset or that it is in seconds and you don't consistently pull the smallest number

dplr	data.table
37.3	9.07
95.5	9.20
496	11.9

Would redo in this way: Here are a couple examples of functions applied to multiple groups within dataset using the largest dataset tested in the link provided above. All outcomes are in seconds:

Dataset used: 1,000,000,000 rows x 9 columns (50GB)

function 1:

# data.table
DT[, .v(v1=sum(v1)), by=id1]     

# dpylr 
DF %>% 
  group_by(id1, .drop = TRUE) %>% 
  summarise(sum(v1))

function 2:

# data.table
 DT[, .v(v1=sum(v1)), by=.(id1, id2)]     

# dplyr     
DF %>% 
  group_by(id1, id2, .drop = TRUE) %>% 
  summarise(sum(v1))

function 3:

# data.table 
DT[, .v(v1=sum(v1)), v3 = mean(v3) by=id3]    

# dplyr                                       
DF %>% 
  group_by(id3 .drop = TRUE) %>% 
  summarise(sum(v1), mean(v3))

groups	dpylr	data.table
100 ad hoc groups of 10m rows, result 100 x 2	35.4	9.07
10,000 ad hoc groups of 100,000 rows, result 10,000 x 3	97.5	9.20
10m ad hoc groups of 100, rows, result 10m x 3	496	11.9

Note: The smaller datasets tested (10,000,000 rows x 9 columns, 0.5 GB and 100,000,000 rows x 9 columns, 5GB) also saw better speed with data.table but the difference was often a few seconds.

I often use large-ish datasets (over a million rows, though often just a few million, with tens of variables) and have had very limited speed issues with dplyr. It is problematic to simply state that one should use data.table since it is faster without quantifying this claim. The data.table speed v. syntax tradeoff may not be worthwhile, especially for the many users who work with sizes where the speed difference is negligible. In some cases, data.table is an excellent option (I especially love fread()), but this post doesn't demonstrate the gains clearly or emphasize when it is useful.

Finally, the graph you include following the table is fairly inscrutable (but looks dramatic since it is a graph quickly sloping up), would recommend rerunning it and making it with the seconds, rather than the confusing ratio. I went ahead and grabbed the code from the paper you cited and edited it to make the suggested graph so here. I'd also like to point out that this example is a odd/niche since it is for selecting a variable out of an increasing number of variables, up to ~ 100,000 variables (always for 5 rows), and it is rare to work with such a large number of variables. The time for dplyr::select() only exceeds one second at 10,000 variables - something that would not matter in many cases.

library("dplyr")
library("ggplot2")
library("rqdatatable")
library("cdata")
library(microbenchmark)
packageVersion("dplyr")
R.version

f <- function(k) {
  d <- rep(list(1:5), k)
  names(d) <- paste0("col_", seq_len(k))
  d <- data.frame(d)
  rownames(d) <- NULL
  d <- tbl_df(d)
  gc()
  tm <- microbenchmark::microbenchmark(
    select(d, col_1),
    d[, "col_1", drop = FALSE],
    times = 3L
  )
  td <- data.frame(tm)
  td$ncol <- ncol(d)
  td
}

times <- lapply(2^(0:17), f)
times <- data.frame(data.table::rbindlist(times))
times$seconds <- times$time/1e9

d1 <- 
  times %>% 
  mutate(b = ifelse(str_detect(expr, "select"), "dplyr", "base"), 
         b = factor(b))

ggplot(data = d1, aes(x = ncol, y = seconds, color = b)) +
  geom_point() + 
  geom_smooth(se = FALSE) + 
  scale_x_log10() + 
  scale_y_log10() +
  scale_color_brewer(palette = "Dark2") +
  scale_x_log10(labels = scales::comma) +
  labs(x = "Number of columns", y = "Seconds", 
       title = "Time to extract first column, dplyr::select() over base R [, ]") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())

If you adjust this example to increase the number of rows, instead of columns, and bump up max number of rows to over ten million, which is a much more real application, the speed gain from base R is particularly negligible, even at 10 million rows:

f <- function(k) {
  d <- rep(list(1:k), 5)
  d <- data.frame(d)
  names(d) <- paste0("col_", seq_len(ncol(d)))
  gc()
  tm <- microbenchmark::microbenchmark(
    select(d, col_1),
    d[, "col_1", drop = FALSE],
    times = 3L
  )
  td <- data.frame(tm)
  td$nrow <- nrow(d)
  td
}

times <- lapply(2^(0:24), f)
times <- data.frame(data.table::rbindlist(times))
times$seconds <- times$time/1e9

d1 <- 
  times %>% 
  mutate(b = ifelse(str_detect(expr, "select"), "dplyr", "base"), 
         b = factor(b))

ggplot(data = d1, aes(x = nrow, y = seconds, color = b)) +
  geom_point(alpha = 0.5, size = 2) + 
  scale_x_log10(labels = scales::comma) +
  labs(x = "Number of rows", y = "Seconds", 
       title = "Time to extract first column, dplyr::select() over base R [, ]") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())

matloff commented 5 years ago

I agree that the H2O presentation is incomplete. But there are various other analyses, and it is a generally known fact, that even RStudio concedes.

ljanda commented 5 years ago

The issue that you are hand-waving and posting unclear tables and graphs still holds. The time differences are often negligible with the dataset sizes many people work with. Also, R has challenges with "big data" regardless of packages used (which is why people turn to parallel computing packages, spark with R, etc). It is irresponsible to cherry-pick, obscure information, and post numbers and graphs without the relevant details.

pmarchand1 commented 5 years ago

Does the comparison rely on the column to be in first position?

ljanda commented 5 years ago

@pmarchand1 yes the first example above (that @matloff cited and I pulled the code from) does, in the code it is "col_1", link: https://github.com/WinVector/Examples/blob/master/dplyr/select_timing.Rmd

matloff commented 5 years ago

See my earlier comment.

matloff / TidyverseSkeptic

vague discussion of speed with dplyr v. data.table #12