Review Ticket for Data Wrangling and Management in R

ianmilligan1 commented 7 years ago

The Programming Historian has received the following tutorial on 'Data Wrangling and Management in R' by Nabeel Siddiqui (@nabsiddiqui). This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/data_wrangling_and_management_in_R

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudsperson (Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

rogorido commented 7 years ago

There is a small inaccuracy in the text. It says:

In R, you will find that looping is becoming increasingly less common due to the pipe operator.

This is not correct: looping is not becoming less used because of the pipe operator. Looping has never been a common practice in R because it uses vectorization which is much faster.

rogorido commented 7 years ago

Only one small point: I think (if possible) you should always write dplyr with small letters. In R capital and small letters are of course different and if you try

library(Dplyr)

you will get an error.

rogorido commented 7 years ago

Sorry, me again. I think you should also link the very good cheatsheet which can be found here.

ianmilligan1 commented 7 years ago

Thanks @rogorido for your comments!

Just a heads up that I have lined up a peer reviewer. Just finding a second.

greebie commented 7 years ago

Hi there,

A couple of initial comments upon first reading.

The mtcars tutorial for a data-frame is pretty common -- I wonder if it just makes sense to show how to make a dataframe in your own dataset rather than re-doing the original tutorial?
Please be careful of broad-range judgements that are not backed up. For instance, in line 6 you say "The most common data type in R is a data frame." Technically, data frames are objects and not data types and certainly not more commonly used as vectors.
I think the tutorial could benefit from some editing for clarity. For example, line 15 seems is a bit of a non sequitur (What if the reader has no clue what Zotero is? Does it matter?) and could be cut without costing you anything on clarity, not to mention that line 16 is a great paragraph that gets right to the point of the section. Then line 17 gets into corpuses without defining them.
Since you promise a tutorial on dplyr, it be nice to see what it can do right off. Maybe start by showing what dplyr can do first and then break it down by explaining dataframes, pipes etc. ? I also think the current introduction does not really tell us what dplyr does. For instance, why would we use this instead of R's pretty robust notation for arranging data?
Overall, I think you do a good job of describing the main dplyr functions. I would like to see these emphasized sooner and the stuff on data.frames reduced more. I also think I'd like to see more about why dplyr would be something a historian would care about (how does this kind of data-wrangling help conducting historical analysis?)

greebie commented 7 years ago

Hello again. I hope my feedback is helpful to you! As promised, I've worked in a more detailed break-down:

Line 4. I think it would be nice to dedicate a couple of sentences on tidyverse, then suggest that you will focus on dplyr. It would just be a nice courtesy for the user who is being asked to install a package when they only want the dplyr library.

Line5. Instead of "R is not the best language for everything. It is designed specifically for exploratory data analysis or advanced statistical modeling" I would just write "R is best for exploratory data analysis and advanced statistical modelling." I don't think your reader will assume that R is the best language for everything.

"Specifically" is a word you can remove from this section. R has a lot of general purposes and is better at some things more than others. R has even been used to create video games!.

I don't think #1 is a good argument for R versus spreadsheets.

Lines 6-14: Since your tutorial is about tidying data, I don't think you should have an extended description of data frames. However, noticing that the R-Basics tutorial uses a data frame without saying "data frame" I can understand your conundrum.

Lines 9-26: I like the pipe tutorial. That we are using "magrittr" all of the sudden (I thought this was about dplyr) might throw the reader. Again, letting people know about all the packages would be great. Especially because they have such wonky names. Knowing that this package is after René Magritte (of "Ceci n'est pas une pipe" fame) might help them remember what the package is for. This is not a pipe -- get it? :)

As discussed in my earlier comment, I think it'd be more readable to see you using pipes to solve real historical problems rather than the same old mtcars tutorial.

Line 27-31: Since this is an intermediate tutorial, I'd really like to see the historical datasets early on.

Line 33: I'm not sure that the section on conditional operators help, especially since you've recommended that the reader have some familiarity with programming. As it stands, it breaks up the description of dplyr and one of the commands ("select) which is confusing.

Line 37: Instead of "pass a conditional" maybe you could write "remove data using a test requirement." Semantically, filtering is more about refining data than it is about operators.

Overall I really enjoyed the last section. I think there could be more detail included about how tidyverse helps organize data and save time -- something that almost every researcher could benefit from. Management issues with data are very frustrating and can waste loads of time. Something like tidyverse can really help people avoid the usually pitfalls of working with wide datasets.

ianmilligan1 commented 7 years ago

Wonderful, thanks for your review @greebie.

@nabsiddiqui – I have lined up another peer reviewer. It's a busy time of term so I have assigned them a deadline of 17 May. Once I receive both reviews, I'll close discussion and sum up the reviews rec'd to date.

nabsiddiqui commented 7 years ago

Thanks @greebie, @rogorido, and @ianmilligan1. I will begin working on the changes that are already posted, and I look forward to the aditional comments on May 17.

nolauren commented 7 years ago

Thank you for the tutorial. It offers an introduction to data frames, tidy data, and then the kinds of exploratory data analysis possible with dplyr. While the tutorial offers needed areas of coverage for the Programming Historian, it could benefit from a clearer audience and structure as well as the building out of several areas. The following feedback is provided to help with these areas.

Audience: It is unclear the exact audience for this tutorial. It moves back and forth between assuming the user is a beginner and intermediate programmer. The tutorial should assume just one audience.

Structure: It seems that the tutorial is trying to do three things. First, it is trying to explain the concept of data frames since it is not covered in the R Basics tutorial. Then, it is attempting to introduce the idea of tidy data followed by doing exploratatory data analysis with dplyr. For structure, I'd recommend the tutorial begin by explaining the goal and purpose of tidy data and dplyr and give a quick example as stated by @greebie. Next, I would introduce the idea of data frames, because they are necessary and aren't covered in the R Basics tutorial. After, I'd move into the details of how to use tidy data and dplyr.

Build Out:

In the current version, the tutorial doesn't explain the theory or ideas behind tidy data or dplyr. These should be introduced at the beginning and reiterated throughout. The initial example should also be historical so the user knows immediately why this method might help.
When covering the details of tidy data and how to use dplyr, the tutorial needs to further explain the key concepts and make it clear why one might want to use them. For example, what can we learn if we filter or mutate the data?

The changes suggested are to help the tutorial meet the needs of historians and humanists more broadly. Otherwise, the tutorial risks being a brief and unclear version of Hadley Wickham's R for Data Science (Section 5: Data Transformation): http://r4ds.had.co.nz/transform.html#introduction-2. I look forward to the next version of this tutorial!

ianmilligan1 commented 7 years ago

Wonderful, thanks so much @nolauren. As always it is extremely appreciated, especially given it's such a busy time of the year.

@nabsiddiqui – we now have two reviews. I'll shortly post my own editorial comments, but I think we've got a great pathway forward.

nabsiddiqui commented 7 years ago

Thanks @nolauren and everyone else. I look forward to your comments @ianmilligan1, and I hope to update the article soon thereafter.

ianmilligan1 commented 7 years ago

Thanks again for your tutorial, and to @greebie and @nolauren for their thoughtful reviews (and to @rogorido's points as well).

I think we've got some good overall pointers from the two reviews. I think it is open to whether you want it to be a beginner or intermediate level. My own gut, given that it requires some prior knowledge and it complements our other introductory R offerings, is that this might be worth pitching at an intermediate level.

In terms of structure, I think both reviewers flagged wanting a bit more introductory work. A bit more on dataframes, dplyr, pipes, and emphasizing the dplyr functions (while keeping it all solidly on track to be relevant to historians). I think by combining the comments, @nolauren provides a good structural suggestion which can be fleshed out with @greebie's particular points.

The build-out comments and specific comments are all worth attending to, and I think can be done without too much work.

What's a feasible turnaround for a new version incorporating these suggestions? Would two or three weeks be feasible, @nabsiddiqui?

nabsiddiqui commented 7 years ago

Dear Ian,

I am a little bit booked until the end of this month, and then I will be in HILT for the first week of June. I was hoping to start revising sometime there after. Unfortunately, that means an end of June time frame. If that is too late, let me know and I'll see if I can move some stuff around.

ianmilligan1 commented 7 years ago

Sure thing! I completely understand the constraint on our time. Late June works well. Could we commit to getting this submitted by 4 July?

nabsiddiqui commented 7 years ago

Sounds good. I will aim for that and let you know if anything comes up to delay it.

ianmilligan1 commented 7 years ago

Just an update: I've chatted with the author and we'll be looking at a late July revision. Looking forward to it!

nabsiddiqui commented 7 years ago

Hello all,

I now revised the submissions and noted my changes below. Thank you again for the initial feedback, and please let me know if there is anything else you would like to see revised further.

Based on requests by @nolauren :

Assumed a more intermediate and consistent audience
Removed the material on data frames, although it was mentioned that it should be kept by @nolauren and removed by @greebie . I decided to do this mainly due to the first point. I realized that data frames are a little off topic in this tutorial given its focus on dplyr. I will be happy to place the information back in if people prefer that it stay, but I am hoping that either someone will cover it in a different tutorial or the audience has enough of a background to know what data frames are.
Added an example to the front using historical census data to give the reader a "preview" of what dplyr can do
Added a paragraph in the introduction that expand on why tidy data should be used and some common ways that data is messy.
Added a small bit of information before introducing the five verbs that details why someone would use them rather than the R base package.

Based on requests by @greebie :

Removed mtcars data frame to include a historical data set
Removed broad-range judgements as I found them
Removed statement about Zotero
Added small information about why we would use dplyr rather than R base packages
Removed data frame material
Added early paragraph on tidyverse and some of the packages in them
Removed comparison between R and Excel
Removed part on conditional operators

Based on comments by @ianmilligan1 :

Used an example where the reader is importing historical data to see provide users a glimpse of how to import their own data
Removed section comparing to Excel based on requests by @nolauren to have more consistent audience and statements by @greebie that they are not well fleshed out.

ianmilligan1 commented 7 years ago

Hi @nabsiddiqui –

Thanks so much for your thoughtful revisions, which I think work really well. In particular, I think adding in your own important function works well; cleaning up some of the prose based on the peer review requests, and I think after working through the lesson again it works without the data frames section. I think the dplyr focus works well. You did an excellent job of talking about tidy data, and I think the focus on R is natural.

I've gone through the lesson and here are some quick comments inline:

Para 9: Maybe just provide a quick reminder to users about how to run a line in R Studio (pressing enter generally does the trick?). I.e. you could note "paste the following code into your R Studio prompt and press enter."
Para 11: Because I know the inevitable questions... they should provide the path to introductory_state_example.csv, right? Maybe make that explicit.
Para 13: I wonder if a screenshot showing the R Studio output of the California vs. New York graph would make sense, just so folks know they're doing it right (plus you get some pretty pictures!).

The rest all worked for me. The one addition I might suggest would be.. is there any simply graph you could use based on the early colleges data? In your introductory import examples we make some neat population graphs. Any quick way to maybe see the number of colleges over time? Or highlight the founding dates? That way we'd come around to the opening example in the early colleges one as well.

I think once you attend to this we'll be ready to begin the final stages. Let me know if you have any questions, comments, or think any of my suggestions are off base.

Again, thanks for your thoughtful revisions and looking forward to working with you on this!

nabsiddiqui commented 7 years ago

@ianmilligan1

Paragraph 9-I will go ahead and add a line stating the way to enter RStudio commands.
Paragraph 11- The paragraph currently reads "Since the data is in a csv file, we are going to use the read_csv() command in tidyverse’s readr package, which takes the path of a file we want to import from as a variable." Should I make it more explicit in the code itself or reword this?
Paragraph 13-There is supposed to be an image of RStudio's output, but I think I messed up when converting it to Markdown compared to RMarkdown. I will get this fixed.
I will add a small example using a bar graph to look at if the colleges before the US War of 1812 were secular or non-secular. I figured we already did a line graph so a different graph would be better. I'll make sure the output from RStudio is included also.

ianmilligan1 commented 7 years ago

Great. I think it should be a bit more explicit in the code, and reword to be more explicit – we often get feedback from readers when something doesn't work, even if they've been told to change the path. So as clear as possible is always best.

And this all sounds great. Let me know when it's ready, and feel free to push up to the repo or e-mail for me and I can do it. Whatever is easiest on your end!

ianmilligan1 commented 7 years ago

OK @nabsiddiqui ,I've updated the code.

A few more things:

can you do a final proof, please let me know any errors you catch;
we'll need a little image. Either of these: wrangling or soap stand out to you? Alternatively, feel free to suggest anything you might find in the four repositories found here.
Can you send me a short bio, 2-4 sentences?

Please ping me here!

nabsiddiqui commented 7 years ago

@ianmilligan1

It looks good to me. I really like the soap image actually so lets go ahead and go with that. As for the bio:

"Nabeel Siddiqui is a doctoral candidate in American Studies at the College of William and Mary, where he is also an inaugural fellow of the Digital Equality Lab. His research focuses broadly on the relationship between the humanities and computation, with a particular expertise on the cultural history of media, new media studies, digital humanities, information studies, and the history of science and technology. Currently, he is completing his dissertation entitled "Byting Out the Public," which looks at the historical relationship between the personal computer and the private sphere."

ianmilligan1 commented 7 years ago

OK great, thanks @nabsiddiqui – I'll start migrating it, probably over the weekend or early next week – we'll do one final check-through once it's on the production site, and then we'll be off to the races. Talk soon.

ianmilligan1 commented 7 years ago

Turns out our infant's nap time is especially productive this afternoon. Here's a preview of it over on the live site!

https://programminghistorian.org/lessons/data_wrangling_and_management_in_R

Let me know thoughts @nabsiddiqui, and then we'll start publicizing it on Monday. 🎉

nabsiddiqui commented 7 years ago

Looks awesome @ianmilligan1 ! Let me know if you need me to do anything further. Thanks again for all your help.

ianmilligan1 commented 7 years ago

Actually @nabsiddiqui – a quick thing. We caught this on the build on the main site, a broken link for https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html. Any thoughts on a sustainable replacement?

mdlincoln commented 7 years ago

(excuse my chiming in) This was the replacement vignette in a recent major version change for dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html

mdlincoln commented 7 years ago

PS I am super excited that we now have a proper dplyr lesson on PH 💯

ianmilligan1 commented 7 years ago

Awesome, thanks @mdlincoln – I'll update the link (let me know if that's not ok @nabsiddiqui!). And yes, I'm super excited too. 😄

nabsiddiqui commented 7 years ago

Sounds good to me. Thanks @mdlincoln for the kind words and the updated link.

ianmilligan1 commented 7 years ago

And hazzah, we're now published.

Thanks @nabsiddiqui for the lesson, and thanks to @nolauren and @greebie for their wonderful peer review comments. I'll begin the publicity, @nabsiddiqui, and any signal boosts much appreciated.

mdlincoln commented 7 years ago

FYI there's an issue with one of the CSVs for this lesson (they should all be hosted within our repo anyway)

https://github.com/programminghistorian/jekyll/issues/543

ianmilligan1 commented 7 years ago

Yes. I am working on this.

programminghistorian / ph-submissions

Review Ticket for Data Wrangling and Management in R #60

Anti-Harassment Policy