matloff / R-vs.-Python-for-Data-Science

430 stars 37 forks source link

Nice writeup, but you should learn the Tidyverse #18

Open dan-reznik opened 5 years ago

dan-reznik commented 5 years ago

Yes, there could be 60+ functions, but rectangling data is complex. I am also sure it is long-tailed, i.e., 5-10 will cover 80-90% of your cases. Also, moving away from base R will help you address your very first point: syntax aesthetics.

matloff commented 5 years ago

Sorry, I don't consider the Tidyverse any more aesthetically pleasing than base R. If so, I would have written my own Tidy-like wrappers long ago.

AdrianAntico commented 5 years ago

@dan-reznik The tidyverse isn't optimal for data wrangling by a long shot. I agree with @matloff that the main reason for the tidyverse being popular is simply a bandwagon effect among newbies. The real solution is to learn data.table. The benefits are tremendous. On top of learning data.table for R, the H2O team is building a replica of data.table for Python, so you should be able to transition easily between programming languages (and avoid Pandas, which is also terrible).

dan-reznik commented 5 years ago

Imagine dplyr w data.table as the backend

AdrianAntico commented 5 years ago

Here's a pretty thorough writeup by both Hadley and the data.table team: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly

Also, I'm pretty sure you can use pipes with data.table but you can also chain with data.table. For example,

data[, temp := mean(Random)][, RandomRes := Random - temp][, temp := NULL] 

You can also use anonymous functions should you desire

matloff commented 5 years ago

Good comments, Adrian and Dan.

dan-reznik commented 5 years ago

I've used data.table[] whenever speed is at a premium, however most of the time expressivity is.

ToeKneeFan commented 5 years ago

Imagine dplyr w data.table as the backend

Do you mean something like the dtplyr package? I don't use it personally, but it generates data.table calls from dplyr syntax.

I've used data.table[] whenever speed is at a premium, however most of the time expressivity is.

I would argue base R + data.table is more expressive, since you can accomplish essentially all of the Tidyverse data-wrangling operations (dplyr, tibble, readr, tidyr, purrr) with very few functions in base R and data.table (mostly just [.data.table and :=).

IyarLin commented 4 years ago

@dan-reznik

Imagine dplyr w data.table as the backend

The dtplyr package does a good job at that - I use it anytime I need to do operations over a large amount of groups and it's pretty seamless

matloff commented 4 years ago

It really becomes a matter of personal taste. I believe Hadley has said that dtplyr can never be as fast as data.table, due to not modifying objects in place. But you may like the dplyr syntax so much that that is the prime issue for you.

IyarLin commented 4 years ago

@matloff I actually went ahead and did a short simulation study (see blog post here). Looks like at least for operations over many groups dtplyr comes pretty close

matloff commented 4 years ago

Thanks for the update. I'm not familiar with the internal workings of dtplyr in its translation to data.table, so I can't comment, but it's good to know that you found it worked well.

jaapwalhout commented 4 years ago

@AdrianAntico

Here's a pretty thorough writeup by both Hadley and the data.table team: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly

Also, I'm pretty sure you can use pipes with data.table but you can also chain with data.table. For example,

data[, temp := mean(Random)][, RandomRes := Random - temp][, temp := NULL] 

You can also use anonymous functions should you desire

See also this Q & A on StackOverflow about chaining with data.table:

  1. the data.table way of chaining (my answer)
  2. chaining with magrittr-pipes (other answer)