matloff / TidyverseSkeptic

An opinionated view of the Tidyverse "dialect" of the R language.
512 stars 46 forks source link

simplistic example of base R v. dplyr #10

Open ljanda opened 5 years ago

ljanda commented 5 years ago

You write:

The Tidyverse also makes heavy use of magrittr pipes, e.g. writing the function composition h(g(f(x))) as

f(x) %>%  g() %>% h()

Again, the pitch made is that this is "English," in this case reading left-to-right. But again, one might question just how valuable that is, and in any event, I personally tend to write such code left-to-right anyway, without using pipes:

a <- f(x)
b <- g(a)
h(b)

This simplistic example does not demonstrate the pain point of stopping and assigning rather than piping and the improved readability that follows, as demonstrated below:

library(tidyverse)
library(knitr)
library(kableExtra)

data(diamonds)

# tidyverse

diamonds %>%
  group_by(cut) %>%
  summarise(Q1 = round(quantile(price, 1/4), 2),
            Median = round(median(price), 2),
            Mean = round(mean(price), 2),
            Q3 = round(quantile(price, 3/4), 2),
            Max = round(max(price), 2)) %>%
    kable(format = "html", format.args = list(big.mark = ','),
          col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max")) %>%
    kable_styling(full_width = FALSE, position = "left")

# base R - there are several ways to do this, this is a shorter one

diamonds_split <- split(diamonds, f = list(diamonds$cut))

result <- do.call(rbind, lapply(diamonds_split, function(x) {
    data.frame(Q1 = round(quantile(x$price, 1/4), 2),
               Median = round(median(x$price), 2),
               Mean = round(mean(x$price), 2),
               Q3 = round(quantile(x$price, 3/4), 2),
               Max = round(max(x$price), 2))
    }
  )
)

result <- data.frame(cut = row.names(result), result)

k1 <- kable(result, format = "html", format.args = list(big.mark = ','),
          col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max"))

kable_styling(k1, full_width = FALSE, position = "left")

As you can see, the base R approach requires a deeper understanding of functions, the ability to use a less clear syntax, and the need to keep assigning rather than piping.

matloff commented 5 years ago

You've never heard of tapply()?

wbuchanan commented 5 years ago

I haven’t heard of tapply() and also fail to see how it is a relevant response given that you don’t seem to use it in the example referenced above. Maybe providing a counter example would prove to be more pedagogically useful rather than responding with a question about an obscure function?

ljanda commented 5 years ago

The issue that you used a trivial example and did not represent piping well still holds. Instead of addressing the issue you're trying make me feel bad for not using a different approach. For what it's worth I used tapply before I had even heard of the tidyverse and I clearly stated that there are multiple solutions in base R.

Here is a link to many examples of base R and tidyverse code comparisons, in general you can see that the code is more readable and pipes are useful (which becomes even more apparent when combining several functions to clean a dataset): https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html

Also, you state that debugging is harder with pipes - this is not true since you can easily run smaller parts of piped code.

DavidArenburg commented 5 years ago

As I see this, @matloff was only addressing the extensive use of pipes rather comparing the whole tidyverse vs base R (in this particular example at least), while you provided a specific example which contains grouping operations- which I think that most of us will agree- is not base R strongest side. This is (probably) one of the reasons data.table was created too. I think your concern about "simplicity" should be addressed to the above comparison of data.table vs dplyr.

On the other hand, if we stick with dplyr (tidyverse) vs. base R, we could also bring many examples where base R is much simpler than the dplyr idiom- you just conveniently picked one that matches the point you are trying to make, @ljanda

wbuchanan commented 5 years ago

@DavidArenburg, It would be useful to provide examples to support your case rather than speaking in generalities. You claim that there are many examples where base R functions are much simpler than the dplyr idiom, but provide no examples of such cases or any description of what qualifies something as simpler from your perspective. Essentially, you are picking convenient phrases with overly general terms to support a non-falsifiable claim you seek to advance. It also isn’t clear which above comparison between data.table and dplyr you are referencing as you seem to be the only one who has referenced data.table in this issue; perhaps provide a reference to the other comment/thread/issue to which you may have been referring?

ljanda commented 5 years ago

@DavidArenburg my example doesn't just include grouping operations - it also has the kable and kableExtra styling to render a nice table, showing that you can pipe the content of the grouping into the table styling functions whereas without the pipes you have to stop and assign multiple times. My point was that @matloff used a trivial example rather than something more meaty that actually shows a difference between the tidyverse and base R. I could have given even more complicated examples that used the full suite of dplyr functions and piping (eg selecting a few variables, mutating them, grouping, then summarizing, without having to stop to assign once), but here I gave a fairly simple one.

matloff commented 5 years ago

@ljanda: Your point about debugging is exactly what I am saying: It's better to break things up as in my example.

DavidArenburg commented 5 years ago

@ljanda I don't see anything special with making intermediate assigns. And I don't think pipes are really related to tidyverse anyway. You can pipe base R and data.table too if you really want to. And neither I think (my own opinion) that pipes are bad and some times they are even useful, but after spending about 5 years seeing all kind of questions and answers on StackOverflow, I see that in general, pipes are being abused by tidyverse users all the time.

For instance, I find this absolutely ridiculous. I mean dataframe %>% select(text) %>% unlist() %>% .[4]? Seriously? Just dataframe[4, "text"] is not cool anymore? I see these nonsense all over the place.

matloff commented 5 years ago

@ljanda, thanks for the Tavares reference. The example is indeed one in which tapply is much clearer, more compact and more straightforward. I've added it to my essay.

ljanda commented 5 years ago

With debugging you are breaking things up regardless or whether you're running part of a pipe or parts of unpiped code

matloff commented 5 years ago

Exactly! You have to revert to base-R to debug. Why not stay there? You'd still get the "read left to right" benefit.

On Mon, Jul 15, 2019, 6:00 PM Ludmila Janda notifications@github.com wrote:

With debugging you are breaking things up regardless or whether you're running part of a pipe or parts of unpiped code

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matloff/TidyverseSkeptic/issues/10?email_source=notifications&email_token=ABZ34ZJLLDKXD7BOGTHYBMLP7SNKPA5CNFSM4ICMREPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6E34Q#issuecomment-511462898, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZ34ZPVLE3TFFLGDHDG333P7SNKPANCNFSM4ICMREPA .

ljanda commented 5 years ago

You don't have to revert to base R, you can just run part of the pipe.

wbuchanan commented 5 years ago

@ljanda I don't see anything special with making intermediate assigns. And I don't think pipes are really related to tidyverse anyway. You can pipe base R and data.table too if you really want to. And neither I think (my own opinion) that pipes are bad and some times they are even useful, but after spending about 5 years seeing all kind of questions and answers on StackOverflow, I see that in general, pipes are being abused by tidyverse users all the time.

For instance, I find this absolutely ridiculous. I mean dataframe %>% select(text) %>% unlist() %>% .[4]? Seriously? Just dataframe[4, "text"] is not cool anymore? I see these nonsense all over the place.

@DavidArenburg There definitely can be differences between pipes and multiple assignments. The pipes are an abstraction layer for the syntax, while multiple assignment means consuming more memory on the system to store additional objects which may or may not be necessary to store (temporarily or otherwise). I agree that the example you posted is completely ridiculous without a doubt. However, I've also encountered cases where analysts will create several different copies of essentially the same object or will repeatedly overwrite the same object multiple times:

masterDF <- merge(Student_Teacher_Link, Student_Attrib, by = c("STUDENT_ID"))
masterDF <- merge(masterDF, Core_Courses,  by = c("TID", "CID", "SCHOOL_YEAR", "SCHOOL_NAME", "SCHOOL_CODE"))
masterDF <- merge(masterDF, Student_Sch_Yr, by = c("STUDENT_ID", "SCHOOL_YEAR"))
names(masterDF)[names(masterDF) == "S_GRADE_CODE"] <- "GRADE_CODE"

################################ Subset Elementary teachers...they are in both math and reading data sets #############
EL <- masterDF[which(masterDF$GRADE_CODE == "01" | masterDF$GRADE_CODE == "02" |
                       masterDF$GRADE_CODE == "03" | masterDF$GRADE_CODE == "04" |
                       masterDF$GRADE_CODE == "05"), ]

######################################## Subset teachers with math/ELA (reading) indicators ########################
forMath <- masterDF[which(masterDF$MATH == "1" & masterDF$CORE == "Yes"), ]
forRead <- masterDF[which(masterDF$ELA == "1" & masterDF$CORE == "Yes"), ]

############################### Stack math/reading with elementary teachers remove dups ##########################
forMath <- rbind(EL, forMath)
forRead <- rbind(EL, forRead)

forRead <- merge(forRead, Student_Scores, by = c("STUDENT_ID", "SCHOOL_YEAR"))
forMath <- merge(forMath, Student_Scores, by = c("STUDENT_ID", "SCHOOL_YEAR"))

That's an example from someone I work with. It isn't representative of the population, but it also highlights an issue with users who aren't terribly versed in programming.

ljanda commented 5 years ago

@wbuchanan this is a great example and a really good point about multiple assignment consuming more memory @DavidArenburg I agree the example you gave is not a great use of pipes and that they can be abused but the pipe is part of the magrittr package, which is technically part of the tidyverse image Also, I think it is worth pointing out that people often build up their pipes - I usually add line by line and check outcomes along the way (which actually helps with debugging)

DavidArenburg commented 5 years ago

@ljanda magrittr wasn't originally part of tidyverse- it was basically contributed to it: https://github.com/tidyverse/magrittr/commit/cf2e33f946ba35ec81f1678e0f2712e521a1c4eb

Regarding your debugging strategy, it basically means that you need to rerun your whole code over and over after adding each line which will probably be time/memory consuming.

DavidArenburg commented 5 years ago

@wbuchanan I think in your example it is better to persist each step like your co-worker did instead of piping it up which would probably get an out of memory error.

Also, if you work with data.table, each merge would update the data in place and both save speed and memory and avoid piping.

Finally, if someone would pipe all these join and would like to pipe additional steps, he would need to run all of these join each time he would add additional step which would be time/memory mess.

All in all (if we ignore the code cleanliness), piping would probably make it worse (in my opinion a least).

karoliskoncevicius commented 5 years ago

Feels like this discussion is missing a few things...

First - even the example by @ljanda is quite simplistic and can be achieved with base in an easier way:

# base
result <- aggregate(price ~ cut, data=diamonds, FUN=function(x) round(summary(x)[-1],2))
result <- kable(result, format = "html", format.args = list(big.mark = ','),
                col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max"))
kable_styling(result, full_width = FALSE, position = "left")

Second - if pipe-like syntax (left to right) is more readable, this too can be achieved within base:

aggregate(price ~ cut, data=diamonds, FUN=function(x) round(summary(x)[-1], 2)) ->.
kable(., format = "html", format.args = list(big.mark = ','),
      col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max")) ->.
kable_styling(., full_width = FALSE, position = "left")

This would also allow to stop in the middle of the "pipeline" and continue from where you left.

So the way I see it the discussion about the advantages and disadvantages of pipe could be compared with this style of syntax instead. Especially because all the advantages proposed seem to only be about readability so far.

wbuchanan commented 5 years ago

@KKPMW Thanks for the alternate example. The only problem I would see is overriding the value of ., but since R uses fairly different operators for method calls on objects it may not be terrible.

That said, there is still some potential overhead differences from reassigning values to the existing object in memory. While I don’t agree with everything @DavidArenburg mentioned above, I do agree that there are definitely cases where data.table is definitely the right solution. What I’m less certain about is whether the same memory benefit is achieved if the object increases in memory consumption along the way. For example, if the data set were arbitrarily small and the aggregation result was several times larger (say something analogous to a multidimensional cube in the world of relational databases), would it still perform as well or would it run into memory corruption issues or overhead associated with reallocating memory, since the pointers would no longer provide access to the necessary amount of memory.

karoliskoncevicius commented 5 years ago

@wbuchanan

That said, there is still some potential overhead differences from reassigning values to the existing object in memory.

Based on a few benchmarks I am getting that with small objects the ->. assignment is a lot faster compared with pipe. And when the object size is large they converge to the same speed.

Small object:

x <- 1:10
microbenchmark(pipe={x %>% log %>% sqrt}, base={x ->.; log(.) ->.; sqrt(.)}, times=1000)

Unit: nanoseconds
 expr   min      lq      mean  median    uq    max neval
 pipe 49933 52920.0 57798.409 55523.5 61429 167158  1000
 base   572   708.5   929.804   834.0   946  55760  1000

Large object:

x <- matrix(abs(rnorm(1000000*100)), ncol=100)
microbenchmark(pipe={x %>% log %>% sqrt}, base={x ->.; log(.) ->.; sqrt(.)}, times=10)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 pipe 2.003351 2.033280 2.057402 2.047359 2.081823 2.125832    10
 base 1.983885 2.016597 2.065985 2.065186 2.102157 2.143859    10

Of course haven't tested this thoroughly. But a few advantages of ->. that come to mind are: 1) no dependencies 2) easier to "get" what is going on behind the scenes 3) can stop at any step and inspect the result . then continue on with the next step without recomputing the whole pipeline 4) faster (probably).

DavidArenburg commented 5 years ago

@KKPMW try benching with bench::mark() or bench::press() as it also tests memory allocation.

wbuchanan commented 5 years ago

@KKPMW I think it is just as easy to step through the code regardless of the convention being used, but definitely interesting to see the differences in performance.

karoliskoncevicius commented 5 years ago

@DavidArenburg

Tried bench::mark and memory allocation only differed for very small objects (in favour of ->.).

# small object
x <- rnorm(20)+100
mark(pipe = {x %>% log %>% head(10) %>% sqrt},
     base = {x ->.; log(.) ->.; head(., 10) ->.; sqrt(.)},
     iterations=10)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 pipe       124.29µs    142µs     6389.      488B        0    10     0
2 base         9.35µs   10.6µs    80682.      208B        0    10     0

# larger object
x <- rnorm(1000000)+100
mark(pipe = {x %>% log %>% head(10) %>% sqrt},
     base = {x ->.; log(.) ->.; head(., 10) ->.; sqrt(.)},
     iterations=10)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 pipe         7.93ms   8.63ms      113.    7.63MB        0    10     0
2 base         7.66ms   8.29ms      120.    7.63MB        0    10     0
drag05 commented 3 years ago

@ljanda: Had you had considered using the switch() function in your base R example it would have worked against your argument. Better yet (and leaving the knitr::kable() out because is out of the scope of discussion):

> require(data.table) 

> dt = as.data.table(diamonds)

> unique(
> dt[, c('Median', 'Mean', 'Q3', 'Max') := .(median(price), mean(price), quantile(price, 3/4), max(price)), keyby = cut], 
   by = 'cut')

    carat       cut color clarity depth table price    x    y    z Median     Mean      Q3   Max
1:  0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49   3282 4358.758 5205.50 18574
2:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31   3050 3928.864 5028.00 18788
3:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48   2648 3981.760 5372.75 18818
4:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31   3185 4584.258 6296.00 18823
5:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43   1810 3457.542 4678.50 18806

No group_by(), no summarize(), no round(). A simple line of code that clearly states the intent and is efficient.

Even simpler (and still no piping!):

> dt[, .(Median = median(as.numeric(price)), Mean = mean(price), Q3 = quantile(price, 3/4), Max = max(price)), keyby = cut]

         cut Median     Mean      Q3   Max
1:      Fair 3282.0 4358.758 5205.50 18574
2:      Good 3050.5 3928.864 5028.00 18788
3: Very Good 2648.0 3981.760 5372.75 18818
4:   Premium 3185.0 4584.258 6296.00 18823
5:     Ideal 1810.0 3457.542 4678.50 18806