merely-useful / talk

A talk about the "Merely Useful" books and what we've learned writing them.
Other
2 stars 0 forks source link

Comparing and contrasting the Python and R books #3

Open DamienIrving opened 3 years ago

DamienIrving commented 3 years ago

For the Python and R RSE books we essentially had to document the logical progression of steps involved in data processing / package development in a generic, teachable manner. One of the main things we need to do in the talk (I think) is succinctly describe, compare and contrast that logical progression in Python and R.

I've had a go at doing the first part (succinctly describe) below - my naive attempt at the R description is missing a bunch of steps and I don't know anything about many of the tools mentioned, so if an R person could provide a more complete description that would be great.

Once we've got a succinct description for Python and R, I'd love to hear people's thoughts on what are the most noteworthy similarities and differences between the Python an R approaches.

Task

Word count analysis/package to confirm Zipf's Law.

Python

R

DamienIrving commented 3 years ago

cc'ing @k8hertweck, @cwickham, @joelostblom and/or @lwjohnst86 to look over and improve the dot points summarising the R approach to a word count analysis/package to confirm Zipf's Law (see above for my naive first attempt).

DamienIrving commented 3 years ago

An alternative way to look at this information...

Task Python R
Adopt a directory structure consistent with ... Python packaging R packaging
Conduct basic file management ... at the command line with fs
Prototype code with ... Jupyter notebook or an IDE RStudio
Write code in a modular, reusable manner using ... functions (that can be stored in .py module files) functions (stored in .R files)
Version code using ... Git at the command line Git via RStudio integration and gert/usethis functions
Collaboratively develop code using ... GitHub GitHub and usethis PR helpers
Test code using ... assertions and unit tests (pytest) expectations and unit tests (testthat)
Automatically run tests using ... Travis CI (future: GitHub Actions) GitHub Actions
Handle program reporting and errors with ... logging and exceptions logging and stop function for errors
Automate data processing workflows using ... command line scripts and Make scripts and targets package
Configure workflows using ... function arguments, command line arguments and YAML files function arguments
Capture workflow provenance by archiving ... scripts, conda environment and Makefile/s on Zenodo Zenodo, GitHub tags/releases
Generate reproducible documents with ... Pweave, jupyter book rmarkdown
Build and distribute a package using ... pip and PyPI devtools and R-hub, CRAN
Document a package and make a website using ... docstrings, sphinx and ReadTheDocs vignettes, roxygen2 and pkgdown
Create and distribute a data package by ... datapackage library to create, store at Figshare/Zenodo/wherever storing and exporting the data object in an R package

Italics indicate things that aren't covered in detail in our books but are included in the table for completeness.

k8hertweck commented 3 years ago

This is so useful! Thanks @DamienIrving !

I edited the table above for the Git in R section, as that's a section I've worked on and could update off the top of my head. The outline of the R book is available here for future reference.

The ??? in the R section have me thinking about how the overall purposes of the books are slightly different. I think this is because there are already excellent resources for R package development, but the missing gap in the R community is specifically package development for data analysis. This means the background knowledge we expect folks to have when starting the book is slightly different, and as a result, the order of topics and specific emphasis differs fairly substantially.

I'm thinking the table showing equivalency of tasks and which tools are used for each is really useful. For topics we're not covering in R, though, it makes sense to talk about the differences in philosophy of creating packages in each language. For example, the use of reproducible reports in R vs. workflow provenance in Python.

Other folks may have more ideas to add here!

DamienIrving commented 3 years ago

Thanks, @k8hertweck.

I should have mentioned - don't worry about the order of the tasks in the table. I'm not trying to make the order match the books.

You make a good point about the fact that some of the tasks in the table aren't a focus for the R book. While most of the content of the table should obviously be things that we cover in the books, for completeness I don't think it's a problem if some of the things aren't in the book. I've edited the table so that things that aren't in the book are indicated in italics (e.g. we don't spend time in the Python book discussing the features of the Jupyter notebook or IDEs for prototyping code, but for completeness I mention that in the table).

Are there widely used/accepted tools that R people use for data processing workflow management and coordination (i.e. execution order, logging, configuration) that could be listed in italics even if that topic isn't covered in the book?

I'm also happy for more tasks to be added to the table (e.g. there might be some topics covered in the R book but not Python). Perhaps "automate report generation using..." is a task we should include?

lwjohnst86 commented 3 years ago

Sorry for the late addition to this, June has been super busy for me. @DamienIrving really really nice table, super useful! I've made some edits to it to fill in some of the spaces.

DamienIrving commented 3 years ago

Thanks, all. The table is looking fairly complete, so it's time to consider what it says about the similarities and differences between how R people and Python people do research software engineering (in a generic, best practice sense, as defined by us). Here's some initial thoughts:

When you break things down into the core tasks associated with data processing and software package creation (i.e. research software engineering), it becomes clear that both Python and R can do the job (i.e. in both cases the tools exist to do the tasks). Having said that, the subtle differences between those tools and the tasks/tools we chose not to include in our books speak to some interesting differences:

Side question: Are the Python and R experiences converging? If you do all your data processing in the Jupyter notebook (which I'm sure is true for a growing number of Python users) then things become more self-contained. Things like make and command line arguments for configuration aren't used and logging becomes less of a need, since you're seeing the output from each command as you execute it. Jupyter book could also make reproducible documents much more popular with Python people.

(I'm certainly not advocating for a move to Jupyter for more than code prototyping - it actually concerns me greatly and is one of the reasons why I think our Python book is important.)

mbonsma commented 3 years ago

I love the table and I agree with your summary of differences @DamienIrving.

As someone who uses Jupyter for almost all analysis IRL, I would agree that the Python and R experiences are converging. The R experience seems intentional and well-supported, while in Python the convergence seems like a coincidence and/or a response to the R ecosystem.

joelostblom commented 3 years ago

Sorry this became long...

Agree with what others have said, super useful comparison table, thanks @DamienIrving ! The only major point that I don't agree with is that reproducible documents are not a big thing in Python. I believe both Jupyter notebooks and R Markdown/Notebooks encourage creation of reproducible reports with commentary and outputs in one place and that they have both have widespread uptake in their communities to the point that they have changed how most people conduct programmatic data analysis.

In terms of the emphasis on reproducibility in the notebook interface, I think the main reproducibility advantage for R Notebooks is that they have be run in order when knitting to another format such as html whereas Jupyter notebooks can be exported with out of order execution. Since R Notebooks don't store output, the most common way to share the results of your analysis is to make sure it runs from top to bottom, which encourages this behavior to a higher degree than in Jupyter notebooks (especially since ipynb is rendered on GitHub). If you are not knitting however, but using the R Notebook interface in RStudio to view output it is still fully possible to run your cells in whatever order you want, save the notebook Rmd file, and then have it not work when you open it in a new session and try running it from top and bottom (please correct me if there is some preventative mechanism for this that I have missed).

Although Jupyter notebooks can be exported out of order, they recently added a visual indicator when cells are edited after they are run and there are packages that keep track of the execution order of cells as well as the definition order of variables (which I think would be one of the best solutions if it could be integrated into the core notebook interfaces in both languages in a non-intrusive way). A simpler mechanisms which I think would be great is if notebooks had a brief warning or visual indicator that encourages to run all cells in order before quitting a session.

I agree that the languages are converging on many solutions, and I believe it would be possible to write a Python book that has more of the same angle as the R book or vice versa. So while I think it is important to contrast the differences between the book strategies and motivate our choices, I think reproducible documents could be part of a project in either language depending on the nature of the project, while makefiles, packages, or other similar mechanisms are always beneficial to include once the project reaches a certain size.

In the table I believe Jupyter notebooks is a more suitable equivalent to R Markdown/R Notebooks and that JupyterBook is more like bookdown/distill in R, which I see as a tool for documentation and multi-page reports, rather than a single reproducible analysis document (but I agree that these are more reproducible since they only way they work is if everything runs from top to bottom). Also Pweave is not maintained anymore I think; using Jupytext with notebooks is a much more popular alternative for most of that functionality.