programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
135 stars 112 forks source link

Review Ticket for 'Basic Text Processing in R' #34

Closed jerielizabeth closed 7 years ago

jerielizabeth commented 7 years ago

The Programming Historian has received the following tutorial on 'Basic Text Processing in R' by @statsmaths and @nolauren. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/basic-text-processing-in-r

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

jerielizabeth commented 7 years ago

Thank you, @nolauren and @statsmaths for this lesson! I think the lesson is ready as is for external reviewers.

I am soliciting for reviewers currently, and hope to have the reviews submitted by the end of January (leaving a little extra time for the holidays).

acrymble commented 7 years ago

I'm really pleased to see another R lesson to compliment our current offering.

Just one very minor point from me: You don't mean 'digitised' text. You mean 'machine readable' text. Digitised could merely mean it's been scanned and is in image format.

walshbr commented 7 years ago

Am loving the lesson so far! I ran into a snag in paragraph 37 when using R to read in the text files from where they're hosted on the Programming Historian staging server. This bit:

files <- sprintf("%s/sotu_text/%03d.txt", base_url, 1:236)
text <- c()
for (f in files) {
  text <- c(text, paste(readLines(f), collapse = "\n"))
}

Ran through the lessons a couple times, copy and pasting all the code snippets, etc. @jerielizabeth said she wasn't having this problem, so we thought throwing it up here would be helpful. The output of 'files' looks something like this:

> files
  [1] "http://programminghistorian.github.io/ph-submissions/assets/basic-text-processing-in-r/sotu_text/001.txt"
  [2] "http://programminghistorian.github.io/ph-submissions/assets/basic-text-processing-in-r/sotu_text/002.txt"
  [3] "http://programminghistorian.github.io/ph-submissions/assets/basic-text-processing-in-r/sotu_text/003.txt"
  [4] "http://programminghistorian.github.io/ph-submissions/assets/basic-text-processing-in-r/sotu_text/004.txt"

And each of those links resolves as expected. My suspicion after experimenting more is that github is throttling my connection thinking that I'm a bot. I've been running a counter over the loop to see how far it is getting before it fails, and it's variable each time. Here are the points in the loop at which it has failed the last few times:

106, 47, 49, 100, 62, 51, 70

Adding a random sleep timer between scrapes didn't really seem to help that much, though I'm giving it a few spins before giving up on that entirely. Any thoughts on what might be going on?

jerielizabeth commented 7 years ago

Status update: we have two reviewers in line for this piece with the goal of having the reviews completed by the end of January (allowing for the holidays).

Once both reviews are in, I will summarize and provide any additional commentary for the authors to work from in revising the lesson.

shawngraham commented 7 years ago

Hi folks - just wanted to say that I found this lesson really useful. If I might make a suggestion - while I know that this is meant more for an intermediate user, it'd be useful perhaps to make a gesture somewhere to how a user might ingest their own texts from a local location. It might also be handy in the further directions to point towards some of the vignettes within the tidyverse, eg tidytext https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

Thanks again,

Shawn

jerielizabeth commented 7 years ago

@bmw9t I am still struggling to duplicate that particular error. Maybe a little more information would help: are you getting any error messages in R that you could share here?

walshbr commented 7 years ago

@jerielizabeth Ah sorry - thought I had pasted the error message in here. The only error message R was giving me was this:

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  unable to connect to 'programminghistorian.github.io' on port 80.

Might have been something with the particular settings on our firewall or something? I suggest that because now I'm on a different network in a different state, and those lines were able to read everything in from GitHub without any problems. I'll check when I'm not traveling and back on the home network. Can also try when I'm back on W&L's network.

walshbr commented 7 years ago

Thanks, @jerielizabeth for facilitating and to @statsmaths and @nolauren for writing this lesson.

I really enjoyed the tutorial! I think it's a great piece, and it's a lovely introduction to basic text analysis with R. I have several point-by-point things below that address particular line numbers as relevant. And then at the end of this comment are just a few typos that I noticed going through. I'm looking forward to being able to point students and faculty to this work in the future.

The bulk of the comments below pertain to comment number two and are just asking for glosses on pieces that appear to be assumed knowledge in the text. Nothing especially substantive. I tended to err on the side of glossing a bit more than the tutorial was originally, though obviously it's your call how much you would like to explain things if you're expecting an intermediate user. For me, I suppose it's a question of "intermediate in what skills relevant to the lesson?" since we're talking about more than just R. I can imagine people with varying skill levels in R, programming more generally, and text analysis being interested in the lesson for different reasons. I tended to assume that you wouldn't need to explain the particular syntax of R snippets, but that it could still be a good idea to give really short glosses of other terms or concepts as you go for people with other backgrounds. I made notes as such below, though take or leave them as you see fit.

Notes

1) I think having a few notes or a link about where to go to install R would be good. You don't need to explain the R syntax, since you make it clear they can get that elsewhere. But giving the installation link would be helpful for an ambitious person trying the lesson out as a first attempt at R.

2) I know you’re expecting a bit more advanced reader, someone familiar with R. But I think you could more consistently gloss some of the statistical methods/terms that you mention. I wouldn’t worry about getting lost in the weeds, but I think it might help a novice reader just to give the one sentence, hand-wavey definition of things. I wouldn’t worry about glossing R syntax, and I also think it’s fine to leave out the glosses during the introduction, since you’ll get them later.

3) In line 4 I would give the command to start R on your computer. That way even if people with no background can follow along and then learn the syntax somewhere else. Mostly thinking about myself here since I am more familiar with other programming languages and was learning R through the lesson.

4) In line 5, I got the prompt:

Do you want to install from sources the package which needs compilation?
y/n:

I think the response to that is implied by the text, but I might explicitly say that you’ll be asked to confirm the installations.

5) During line 5 I got this warning while installing tidyverses:

The downloaded source packages are in
    ‘/private/var/folders/jl/l0cs5mls1l1587v859x370r00000gq/T/RtmpRJfKau/downloaded_packages’
Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
  Reason: image not found

Looks like it’s not a big deal, since I was able to run through the lesson no problem. But might be worth mentioning it or (more likely) pointing to a place (like stack overflow) where people can troubleshoot installation errors.

Similarly I got a warning when loading tidyverse:

Warning message:
package ‘ggplot2’ was built under R version 3.3.2

Might be good to tell people to ignore warnings, or distinguish between them and errors.

6) line 8 might be a good moment to just say something like "…will print out the paragraph of text verbatim because 'text' now stores the document inside it." or something like that.

7) line 9 - I would say something about the output of words: [[1]] [1] "now" "i" "understand" "that" "because" [6] "it's" "an" "election" "season" "expectations" [11] "for" "what" "we" "will" "achieve" [16] "this" "year" "are" "low" "but"

Etc. That's different than how other languages would format the output of something like this, so it might be useful just to say something quick like "you'll get the tokenized output of the text along with the positions of the tokens."

And in general, the tutorial is a bit inconsistent about dealing with console output. For the most part, it doesn't actually display the output beyond the graphs later in the lesson. But I might think about adding in the expected output more often/more consistently. After all, the text of the lesson does reference the console output a fair bit. Not especially necessary, but it could be helpful hand-holding.

7) I think line 12 is a good model for the balance of hand-waving and hand-holding that I've been suggesting elsewhere.

8) data frame in line 14 might be helpfully glossed as "a type of data structure" or something like that. I don't think you need that for 'list' because that's pretty intuitive.

9) Line 16 - what is a tibble?

10) I think I would give the rearranged output as well. Particularly because (slightly out of the scope perhaps), it seems like an opportunity to namecheck the concept of stopwords.

11) line 19 - If you are a little more verbose on the output for number 7 above, I think you don't need to worry about parsing the output again here. So it's fine as is.

12) I would gloss character vector.

13) Probably an obvious point, but maybe want to change the URL for the base_url at line 24 to the production resting place of the text when you switch this off of Programming Historian's staging site. I would also just give a comment-level gloss of what the second and third sentences are doing for those following lines of code.

14) line 27 could also be an opportunity to namecheck stopwords again.

15) I think lines 27-8 could use a few more sentences of explication - I got a little lost when using the Web Trillion Words Corpus. I think a lot of it might be helped by expanding the phrase "these frequencies". These frequencies are the most common frequencies in sites indexed in google, right? So you're using that corpus to generate a list of high-frequency words, then seeing the frequency hits for the tokens of our own corpus? I would just expand on how you're using that WTWC and how it relates to the own example text analysis. Since it seems that you're definitely heading in the direction of filtering stopwords (after reading a little further), I think introducing the concept earlier as well as why you would want to filter them out will help clarify what the section is doing.

16) After "along with the option n set to 15 is used" I would add something like… "to print out more than the default 10 values." For the beginners in the room like me.

17) line 37 - I sometimes get an error saying R can't open the connection:

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  unable to connect to 'programminghistorian.github.io' on port 80.

But when I am on a different network the files are able to load fine, so might not be something to worry about.

18) line 46 - "complex dimensionality reduction algorithms" - might be worth giving another phrase to gloss briefly what this means, though you get to it a bit later.

19) line 50 - This paragraph was a bit hard for me to follow. I think it might help to explain what you mean by "summarize all of the lengths within a document." What are different ways to summarize lengths?

20) In line 60 - "applying dimensionality reduction techniques to plot stylistic tendencies over time or across multiple authors." This is the sort of gloss that I was suggesting in point 18 above, so it might be better deployed the first time you mention the concept.

Typos:

That's it for me, but I'm happy to discuss any of these further! Really great work.

jerielizabeth commented 7 years ago

Thank you @bmw9t and @shawngraham!

@statsmaths and @nolauren, once we have the second review, I'll provide some summary comments and give guidance on how to proceed with revisions. You are welcome to respond to the feedback here on GitHub as it arrives as well, though please save updating the tutorial until after the final review is in place.

histlib commented 7 years ago

In general, I found this tutorial to be useful (and helped me a bit since I am new to tidyverse). I did want to see more explanation of all of the code, especially since that would help users of the tutorial have a better sense of how they might apply this in different contexts.

1: "raw, digitized text." "involves only using a tokenizer" - implies that all the tutorial will cover is tokenization 3: thinking about appending clause to "covers all of the assumed knowledge" so that what's assumed is clear (= installing/starting R, packages, importing data, executing programs ...). 4: cite the two packages so that there are links to CRAN perhaps add screenshot and a bit more handholding:

jerielizabeth commented 7 years ago

Thank you, @histlib, for your comments and feedback on the lesson! There is a lot here between the two reviews, so I will take a little bit to go through it and offer my own recommendations to the authors. I will post my comments here within a week.

Thank you again to both @bmw9t and @histlib for the reviews! While additional comments are always welcome, the review phase for this lesson is now closed.

jerielizabeth commented 7 years ago

Thank you again, @histlib and @bmw9t for these excellent reviews! and thank you @shawngraham for the suggestion of additional information on how to use a new corpus for analysis.

@nolauren and @statsmaths, it seems that the overall consensus in the reviews is that some more explanatory text, both with regards to R syntax and with regards to statistical concepts, would help people using this lesson better understand not only how to use these libraries, but how those libraries are working. We should probably decide what difficulty to set this lesson at. I think it makes a good intermediate lesson, but since we have very little R-focused materials on the site, I agree with the reviewers that additional explanation would be useful. Screenshots of outputs so that users know that they are on the right track would be a useful addition throughout the lesson. I think it would also be useful to state that this lesson assumes that the reader is working in the R Console (and include a version range that this lesson should work with). I also think the opening sections would benefit from some explanation about the packages — what they do and what they include — especially to help people think critically about these functions and to avoid a reliance on “magic”.

To Shawn’s point that it would be helpful to illustrate how one could use a different dataset, would it be possible to provide the text files for download as part of the lesson and discuss any formatting requirements that might need to be done to do the same process on a different set of texts? I am in favor of having people pull in from the server in the lesson to avoid inevitable problems of people renaming or saving the files to an unexpected location. Perhaps as an optional “if you are interested in examining the files, you can download them here” and then provide a code example of how to load in from a local directory down in paragraph 24. That would also address Brandon’s trouble accessing the GitHub servers while on a campus network. And yes, I’ll make a note that we need to update the urls when we move the lesson over to the main site.

As you work through the rest of the feedback from the reviewers, I will leave this space open for you to ask for clarification from myself or from them about suggested edits or particular areas for additional explanatory text.

Our recommended timeframe for revisions is 4 week, which would put us at March 1. Let me know if you need more time than that.

Thank you again for this lesson, I think it will make an excellent addition to the Programming Historian. And thank you again to the reviewers for the thorough and helpful feedback!

statsmaths commented 7 years ago

Thank you @bmw9t, @histlib, @shawngraham, and @jerielizabeth for all of the very helpful comments and feedback. We have pushed several commits that address the various points made by both reviewers. Details are below:

@bmw9t:

@histlib:

@shawngraham:

@jerielizabeth:

jerielizabeth commented 7 years ago

Hi all!

Thank you -- the edits look great and I think address the reviewers' concerns within the constraints of keeping the tutorial narrowly focused and accessible.

I've done some copy editing in preparation for publication and have submitted those to you as a pull request. Let me know if you run into any trouble with that format, and I can send them another way. There were two additional places where I thought the console log would be useful, and one place (para 43) where I think a little more information about how the metadata lines and the text files are related would be useful for the beginning user.

Once you are happy with the edits, I'll give it one more read through to catch any straggling typos and then I think we'll be ready to publish!

statsmaths commented 7 years ago

@jerielizabeth Reviewed and accepted your pull request. Looks great on our end to go forward. Please flag if anything else in needed on our end.

jerielizabeth commented 7 years ago

Excellent! Thank you @statsmaths!

I'm getting started on the final edits and prepping to move this lesson over at the beginning of next week (for maximum eyes.) Any thoughts on this for a cover image: https://flic.kr/p/odcdqu

jerielizabeth commented 7 years ago

Also, @statsmaths and @nolauren -- can you send me a short bio for each of you, either via email or in here. One sentence if possible, two tops. Thanks!

jerielizabeth commented 7 years ago

Hi all!

The lesson is now live on the site at http://programminghistorian.org/lessons/basic-text-processing-in-r. If anyone sees any errors, let me know and I'll fix them!

Once we're ready to promote the lesson, I'll close this issue. That will be the signal to tweet widely!

Thank you all again for your work on this! Thanks to you, we have another excellent addition to the offerings on the Programming Historian.

ianmilligan1 commented 7 years ago

🎉

walshbr commented 7 years ago

All looks good to me! Looking forward to promoting and congratulations!

jerielizabeth commented 7 years ago

I concur! Basic Text Processing in R is official published!