Closed ianmilligan1 closed 7 years ago
There is a small inaccuracy in the text. It says:
In R, you will find that looping is becoming increasingly less common due to the pipe operator.
This is not correct: looping is not becoming less used because of the pipe operator. Looping has never been a common practice in R because it uses vectorization which is much faster.
Only one small point: I think (if possible) you should always write dplyr with small letters. In R capital and small letters are of course different and if you try
library(Dplyr)
you will get an error.
Sorry, me again. I think you should also link the very good cheatsheet which can be found here.
Thanks @rogorido for your comments!
Just a heads up that I have lined up a peer reviewer. Just finding a second.
Hi there,
A couple of initial comments upon first reading.
The mtcars tutorial for a data-frame is pretty common -- I wonder if it just makes sense to show how to make a dataframe in your own dataset rather than re-doing the original tutorial?
Please be careful of broad-range judgements that are not backed up. For instance, in line 6 you say "The most common data type in R is a data frame." Technically, data frames are objects and not data types and certainly not more commonly used as vectors.
I think the tutorial could benefit from some editing for clarity. For example, line 15 seems is a bit of a non sequitur (What if the reader has no clue what Zotero is? Does it matter?) and could be cut without costing you anything on clarity, not to mention that line 16 is a great paragraph that gets right to the point of the section. Then line 17 gets into corpuses without defining them.
Since you promise a tutorial on dplyr, it be nice to see what it can do right off. Maybe start by showing what dplyr can do first and then break it down by explaining dataframes, pipes etc. ? I also think the current introduction does not really tell us what dplyr does. For instance, why would we use this instead of R's pretty robust notation for arranging data?
Overall, I think you do a good job of describing the main dplyr functions. I would like to see these emphasized sooner and the stuff on data.frames reduced more. I also think I'd like to see more about why dplyr would be something a historian would care about (how does this kind of data-wrangling help conducting historical analysis?)
Hello again. I hope my feedback is helpful to you! As promised, I've worked in a more detailed break-down:
Line 4. I think it would be nice to dedicate a couple of sentences on tidyverse, then suggest that you will focus on dplyr. It would just be a nice courtesy for the user who is being asked to install a package when they only want the dplyr library.
Line5. Instead of "R is not the best language for everything. It is designed specifically for exploratory data analysis or advanced statistical modeling" I would just write "R is best for exploratory data analysis and advanced statistical modelling." I don't think your reader will assume that R is the best language for everything.
"Specifically" is a word you can remove from this section. R has a lot of general purposes and is better at some things more than others. R has even been used to create video games!.
I don't think #1 is a good argument for R versus spreadsheets.
Lines 6-14: Since your tutorial is about tidying data, I don't think you should have an extended description of data frames. However, noticing that the R-Basics tutorial uses a data frame without saying "data frame" I can understand your conundrum.
Lines 9-26: I like the pipe tutorial. That we are using "magrittr" all of the sudden (I thought this was about dplyr) might throw the reader. Again, letting people know about all the packages would be great. Especially because they have such wonky names. Knowing that this package is after René Magritte (of "Ceci n'est pas une pipe" fame) might help them remember what the package is for. This is not a pipe -- get it? :)
As discussed in my earlier comment, I think it'd be more readable to see you using pipes to solve real historical problems rather than the same old mtcars tutorial.
Line 27-31: Since this is an intermediate tutorial, I'd really like to see the historical datasets early on.
Line 33: I'm not sure that the section on conditional operators help, especially since you've recommended that the reader have some familiarity with programming. As it stands, it breaks up the description of dplyr and one of the commands ("select) which is confusing.
Line 37: Instead of "pass a conditional" maybe you could write "remove data using a test requirement." Semantically, filtering is more about refining data than it is about operators.
Overall I really enjoyed the last section. I think there could be more detail included about how tidyverse helps organize data and save time -- something that almost every researcher could benefit from. Management issues with data are very frustrating and can waste loads of time. Something like tidyverse can really help people avoid the usually pitfalls of working with wide datasets.
Wonderful, thanks for your review @greebie.
@nabsiddiqui – I have lined up another peer reviewer. It's a busy time of term so I have assigned them a deadline of 17 May. Once I receive both reviews, I'll close discussion and sum up the reviews rec'd to date.
Thanks @greebie, @rogorido, and @ianmilligan1. I will begin working on the changes that are already posted, and I look forward to the aditional comments on May 17.
Thank you for the tutorial. It offers an introduction to data frames, tidy data, and then the kinds of exploratory data analysis possible with dplyr. While the tutorial offers needed areas of coverage for the Programming Historian, it could benefit from a clearer audience and structure as well as the building out of several areas. The following feedback is provided to help with these areas.
Audience: It is unclear the exact audience for this tutorial. It moves back and forth between assuming the user is a beginner and intermediate programmer. The tutorial should assume just one audience.
Structure: It seems that the tutorial is trying to do three things. First, it is trying to explain the concept of data frames since it is not covered in the R Basics tutorial. Then, it is attempting to introduce the idea of tidy data followed by doing exploratatory data analysis with dplyr. For structure, I'd recommend the tutorial begin by explaining the goal and purpose of tidy data and dplyr and give a quick example as stated by @greebie. Next, I would introduce the idea of data frames, because they are necessary and aren't covered in the R Basics tutorial. After, I'd move into the details of how to use tidy data and dplyr.
Build Out:
In the current version, the tutorial doesn't explain the theory or ideas behind tidy data or dplyr. These should be introduced at the beginning and reiterated throughout. The initial example should also be historical so the user knows immediately why this method might help.
When covering the details of tidy data and how to use dplyr, the tutorial needs to further explain the key concepts and make it clear why one might want to use them. For example, what can we learn if we filter or mutate the data?
The changes suggested are to help the tutorial meet the needs of historians and humanists more broadly. Otherwise, the tutorial risks being a brief and unclear version of Hadley Wickham's R for Data Science (Section 5: Data Transformation): http://r4ds.had.co.nz/transform.html#introduction-2. I look forward to the next version of this tutorial!
Wonderful, thanks so much @nolauren. As always it is extremely appreciated, especially given it's such a busy time of the year.
@nabsiddiqui – we now have two reviews. I'll shortly post my own editorial comments, but I think we've got a great pathway forward.
Thanks @nolauren and everyone else. I look forward to your comments @ianmilligan1, and I hope to update the article soon thereafter.
Thanks again for your tutorial, and to @greebie and @nolauren for their thoughtful reviews (and to @rogorido's points as well).
I think we've got some good overall pointers from the two reviews. I think it is open to whether you want it to be a beginner or intermediate level. My own gut, given that it requires some prior knowledge and it complements our other introductory R offerings, is that this might be worth pitching at an intermediate level.
In terms of structure, I think both reviewers flagged wanting a bit more introductory work. A bit more on dataframes, dplyr, pipes, and emphasizing the dplyr functions (while keeping it all solidly on track to be relevant to historians). I think by combining the comments, @nolauren provides a good structural suggestion which can be fleshed out with @greebie's particular points.
The build-out comments and specific comments are all worth attending to, and I think can be done without too much work.
What's a feasible turnaround for a new version incorporating these suggestions? Would two or three weeks be feasible, @nabsiddiqui?
Dear Ian,
I am a little bit booked until the end of this month, and then I will be in HILT for the first week of June. I was hoping to start revising sometime there after. Unfortunately, that means an end of June time frame. If that is too late, let me know and I'll see if I can move some stuff around.
Sure thing! I completely understand the constraint on our time. Late June works well. Could we commit to getting this submitted by 4 July?
Sounds good. I will aim for that and let you know if anything comes up to delay it.
Just an update: I've chatted with the author and we'll be looking at a late July revision. Looking forward to it!
Hello all,
I now revised the submissions and noted my changes below. Thank you again for the initial feedback, and please let me know if there is anything else you would like to see revised further.
Based on requests by @nolauren :
Based on requests by @greebie :
Based on comments by @ianmilligan1 :
Hi @nabsiddiqui –
Thanks so much for your thoughtful revisions, which I think work really well. In particular, I think adding in your own important function works well; cleaning up some of the prose based on the peer review requests, and I think after working through the lesson again it works without the data frames section. I think the dplyr focus works well. You did an excellent job of talking about tidy data, and I think the focus on R is natural.
I've gone through the lesson and here are some quick comments inline:
introductory_state_example.csv
, right? Maybe make that explicit.The rest all worked for me. The one addition I might suggest would be.. is there any simply graph you could use based on the early colleges data? In your introductory import examples we make some neat population graphs. Any quick way to maybe see the number of colleges over time? Or highlight the founding dates? That way we'd come around to the opening example in the early colleges one as well.
I think once you attend to this we'll be ready to begin the final stages. Let me know if you have any questions, comments, or think any of my suggestions are off base.
Again, thanks for your thoughtful revisions and looking forward to working with you on this!
@ianmilligan1
Great. I think it should be a bit more explicit in the code, and reword to be more explicit – we often get feedback from readers when something doesn't work, even if they've been told to change the path. So as clear as possible is always best.
And this all sounds great. Let me know when it's ready, and feel free to push up to the repo or e-mail for me and I can do it. Whatever is easiest on your end!
OK @nabsiddiqui ,I've updated the code.
A few more things:
Please ping me here!
@ianmilligan1
It looks good to me. I really like the soap image actually so lets go ahead and go with that. As for the bio:
"Nabeel Siddiqui is a doctoral candidate in American Studies at the College of William and Mary, where he is also an inaugural fellow of the Digital Equality Lab. His research focuses broadly on the relationship between the humanities and computation, with a particular expertise on the cultural history of media, new media studies, digital humanities, information studies, and the history of science and technology. Currently, he is completing his dissertation entitled "Byting Out the Public," which looks at the historical relationship between the personal computer and the private sphere."
OK great, thanks @nabsiddiqui – I'll start migrating it, probably over the weekend or early next week – we'll do one final check-through once it's on the production site, and then we'll be off to the races. Talk soon.
Turns out our infant's nap time is especially productive this afternoon. Here's a preview of it over on the live site!
https://programminghistorian.org/lessons/data_wrangling_and_management_in_R
Let me know thoughts @nabsiddiqui, and then we'll start publicizing it on Monday. 🎉
Looks awesome @ianmilligan1 ! Let me know if you need me to do anything further. Thanks again for all your help.
Actually @nabsiddiqui – a quick thing. We caught this on the build on the main site, a broken link for https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html. Any thoughts on a sustainable replacement?
(excuse my chiming in) This was the replacement vignette in a recent major version change for dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html
PS I am super excited that we now have a proper dplyr lesson on PH 💯
Awesome, thanks @mdlincoln – I'll update the link (let me know if that's not ok @nabsiddiqui!). And yes, I'm super excited too. 😄
Sounds good to me. Thanks @mdlincoln for the kind words and the updated link.
And hazzah, we're now published.
Thanks @nabsiddiqui for the lesson, and thanks to @nolauren and @greebie for their wonderful peer review comments. I'll begin the publicity, @nabsiddiqui, and any signal boosts much appreciated.
FYI there's an issue with one of the CSVs for this lesson (they should all be hosted within our repo anyway)
Yes. I am working on this.
The Programming Historian has received the following tutorial on 'Data Wrangling and Management in R' by Nabeel Siddiqui (@nabsiddiqui). This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/lessons/data_wrangling_and_management_in_R
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.