matloff / R-vs.-Python-for-Data-Science

430 stars 37 forks source link

Tidyverse vs 'base' R (Language Unity) #13

Open pablox-cl opened 5 years ago

pablox-cl commented 5 years ago

I have been learning R in a mix using base R and the tidyverse (mostly dplyr). I got concerned about Language Unity, because I could started to use R thanks to the tidyverse (just my experience of course) before using dplyr it was just too hard. Have you ever written in detail about this issue so I can read more?

matloff commented 5 years ago

I don't think of dplyr as tidyverse.

BobMuenchen commented 5 years ago

The tidyverse does make many things easy, but it also adds complexity for even simple things like printing: https://www.r-bloggers.com/the-tidyverse-curse/.

matloff commented 5 years ago

Thanks for the link to the "curse" comment. I once saw a posted writeup of the "proper" way to do "Hello world!" in Java, and it was something like 20 lines long!

BobMuenchen commented 5 years ago

Your comment about unity got me wondering what the ratio of base to tidyverse functions might be. Summing the length(getNamespaceExports("package_name") on base, stats, utils, methods, and graphics gets 2,519 functions. Doing the same on the tidyverse packages gets 1,162 functions. So the tidyverse is nearing half the size of the main R installation (I skipped Autoloads, and grDevices, guessing that they're shared by both).

pablox-cl commented 5 years ago

@BobMuenchen thanks for the link!

@matloff oh, then I'm more lost. I understand that the tidyverse is a collection of tools, being dplyr one of the most important, and also dplyr uses pipes a lot, that's why I made the link :). Could you clarify when the issue arises?

matloff commented 5 years ago

Correct me if I am wrong, but I don't think the original dplyr used pipes, and moreover, it certainly would not have to.

stevekm commented 5 years ago

Don't overlook the issue of unnecessary dependencies. Every time you include a non-base R library, you now have to drag along and version control that library every where you want to use your code. This is not trivial, and is the source of many problems when working on teams or shared projects across multiple systems. Years of experience have shown that you should simply learn the base-R methods of doing most things (easily findable on Stack Overflow) and avoid dyplr, tidyverse, etc. The few edge cases that can't be adequately handled with base R are generally solved with reshape2 and data.table and the like, barring things like ggplot2, knitr, etc. which are non-base essentials.

pablox-cl commented 5 years ago

@matloff no, you don't need to. But if you see the documentation for dplyr the magrittr pipe (%>%) appears ubiquitously. That's why I made the relation.

@stevekm I understand the problem of unnecessary dependencies. I suppose in the end, it will always depends on the compromises you need to make for the sake of your project.


Edit: I don't know how old is this other article, but I believe answers the issue I was asking perfectly: https://github.com/matloff/TidyverseSkeptic Thanks!

matloff commented 5 years ago

Yes, please see my TidyverseSkeptic page, which I update constantly.

There is nowhere, including in dplyr, where pipes are necessary.

The issue of UNNECESSARY dependencies is a serious one. Not only is it a nuisance, but also can cause trouble in a shared system situation, e.g. a school.