Open dan-reznik opened 5 years ago
Sorry, I don't consider the Tidyverse any more aesthetically pleasing than base R. If so, I would have written my own Tidy-like wrappers long ago.
@dan-reznik The tidyverse isn't optimal for data wrangling by a long shot. I agree with @matloff that the main reason for the tidyverse being popular is simply a bandwagon effect among newbies. The real solution is to learn data.table. The benefits are tremendous. On top of learning data.table for R, the H2O team is building a replica of data.table for Python, so you should be able to transition easily between programming languages (and avoid Pandas, which is also terrible).
Imagine dplyr w data.table as the backend
Here's a pretty thorough writeup by both Hadley and the data.table team: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly
Also, I'm pretty sure you can use pipes with data.table but you can also chain with data.table. For example,
data[, temp := mean(Random)][, RandomRes := Random - temp][, temp := NULL]
You can also use anonymous functions should you desire
Good comments, Adrian and Dan.
I've used data.table[] whenever speed is at a premium, however most of the time expressivity is.
Imagine dplyr w data.table as the backend
Do you mean something like the dtplyr package? I don't use it personally, but it generates data.table calls from dplyr syntax.
I've used data.table[] whenever speed is at a premium, however most of the time expressivity is.
I would argue base R + data.table is more expressive, since you can accomplish essentially all of the Tidyverse data-wrangling operations (dplyr, tibble, readr, tidyr, purrr) with very few functions in base R and data.table (mostly just [.data.table
and :=
).
@dan-reznik
Imagine dplyr w data.table as the backend
The dtplyr package does a good job at that - I use it anytime I need to do operations over a large amount of groups and it's pretty seamless
It really becomes a matter of personal taste. I believe Hadley has said that dtplyr can never be as fast as data.table, due to not modifying objects in place. But you may like the dplyr syntax so much that that is the prime issue for you.
@matloff I actually went ahead and did a short simulation study (see blog post here). Looks like at least for operations over many groups dtplyr comes pretty close
Thanks for the update. I'm not familiar with the internal workings of dtplyr in its translation to data.table, so I can't comment, but it's good to know that you found it worked well.
@AdrianAntico
Here's a pretty thorough writeup by both Hadley and the data.table team: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly
Also, I'm pretty sure you can use pipes with data.table but you can also chain with data.table. For example,
data[, temp := mean(Random)][, RandomRes := Random - temp][, temp := NULL]
You can also use anonymous functions should you desire
See also this Q & A on StackOverflow about chaining with data.table
:
data.table
way of chaining (my answer)magrittr
-pipes (other answer)
Yes, there could be 60+ functions, but rectangling data is complex. I am also sure it is long-tailed, i.e., 5-10 will cover 80-90% of your cases. Also, moving away from base R will help you address your very first point: syntax aesthetics.