Closed nabsiddiqui closed 3 months ago
This is a very interesting tutorial that I think our audience will enjoy. There are two macro issues that need to be addressed before moving forward and some small typos suggestions:
Paragraph 1 in Introduction to YouTube Scraping and Analysis-Remove the second sentence
Paragraph 2 in Introduction to YouTube Scraping and Analysis-Remove third word ("also")
Paragraph 3 in Introduction to YouTube Scraping and Analysis-Remove "both" in first sentence
Paragraph 3 in Introduction to YouTube Scraping and Analysis-"with the formation of organizations such as the Association of Internet Researchers"
Paragraph 5 in Introduction to YouTube Scraping and Analysis-"Through this tutorial, you will learn how to access the YouTube API, process and clean the video metadata, and analyze the comment threads for ideological scaling."
Paragraph 8 and 9 in Introduction to YouTube Scraping and Analysis-Do not need these two paragraph
Paragraph 2 in Scraping the YouTube API-beware should read "be aware"
After Paragraph 3 in Configuring Your Code-Need a screenshot here
Final Paragraph in Configuring Your Code-"on the Github repository" should read "in the GitHub repository"
Last two paragraphs in Configuring Your Code-last two paragraphs should be combined
Please let me know what your timeline is on this and any questions you may have @hawc2, @jantsen, and @nlgarlic.
Thanks @nabsiddiqui. I made the minor edits you mentioned, except Paragraph 8 and 9 seem worth keeping, perhaps as a footnote?
I also wasn't sure what screenshot we should insert for Paragraph 3 in Configuring your Code?
We will update the code to include more comments as you ask, and once we get the knitting to work correctly with the .rmd file, we will update the Github repo with the proper formatting for the markdown file.
We aim to be done with our edits next week. I'll update you when the file is ready for review.
Sounds good @hawc2. Let me know if you need anything else on my end. And yes, paragraph 8 and 9 can be kept as footnotes.
Hi @nabsiddiqui and @hawc2! Just checking in to see if you needed any help moving this lesson forward.
Hi - thanks for reaching out. We actually just met to go over the changes this week. We expect to have them ready in the next couple of weeks at the latest.
Nikki
Sent from my iPhone
On Dec 9, 2021, at 7:21 PM, Sarah Melton @.***> wrote:
Hi @nabsiddiquihttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnabsiddiqui&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KmaDX2Wh5gX47q%2Fh%2Bj%2B5wgG5oxzB4EnW7L2oXo8HZJw%3D&reserved=0 and @hawc2https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhawc2&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PsSmAC17EazwcsGUqIg2dngvJZoWcyXoKQLoKBbc%2FJA%3D&reserved=0! Just checking in to see if you needed any help moving this lesson forward.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprogramminghistorian%2Fph-submissions%2Fissues%2F374%23issuecomment-990437810&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749146818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UOyQdDS%2FTe4m%2By0a6ih62B%2FMOUKK3VgT6Vlsul%2BMTz4%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANJXLVGU76JBLWWBT5PQIXDUQFBXRANCNFSM46VS3TNQ&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SdTHn7UfETK3%2FfOwLIyKK4qtqp1JuCPxVYWg16JSeVo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uKU2xGRERE9qAqfvshxrC%2FbuxQXwOxjXUOdjbDhpMnQ%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749166735%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9HERDXqjmQyXOw24W%2BWw9sc9X55UyFpVi%2FMOSLN%2Bg78%3D&reserved=0.
Hey @svmelton. @hawc2 had requested some additional time to work on this in an email he sent to me. I should have probably communicated that in the issue tracker. But, no I think we are all good on moving the lesson forward as planned.
Hello all,
Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals
A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r
Please let me know if you encounter any difficulties or have any questions.
Very best, Anisa
@nabsiddiqui we now have a new topic for R for the menu, please include it as "r" in the topics in the lesson metadata. Thanks
Thank you, @jenniferisasi! I've added this to the YAML for you @nabsiddiqui.
@nabsiddiqui we’re excited to report we finished updating our YouTube scraping tutorial and it should now be ready for review: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r
Apologies for delays - after we submitted this lesson earlier in the pandemic, we discovered that there were a few sustainability issues, especially involving some of the libraries we were using for text wrangling data into WordFish. We’ve switched to quanteda
, and in the process, we condensed the lesson and hopefully simplified/clarified a few sections. We’ve made some other updates as well in order to streamline the code, including removing specific directions for setting up access to the YouTube API, since those directions seem to be changing regularly, and we can link to the Google page with the directions.
We also removed a few options for granular scraping, including a way to search for videos through the API. We intend to provide some of these alternatives on our Github page for those who’d like to explore further. Near the end, we’ve reduced the number and complexity of visualizations, but if reviewers think it needs more, we could build that section out more. Probably some steps in the newer version of the code need to be explicated further.
We look forward to getting feedback on this lesson!
Dr. Heather Lang @hlang264 and Dr. Janna Joceli Omena @jannajoceli have graciously agreed to serve as reviewers. We are shooting for a late August response for now @hawc2.
Thanks @nabsiddiqui.
One update - YouTube has created a streamlined path for Researchers looking to access their API: https://research.youtube/
This might make some aspects of our tutorial easier, and may require some updating. We'll investigate how this process works and update our tutorial accordingly in the fall after we receive reviewer feedback.
Tutorial review by Janna Joceli Omena
Title: "Text Mining YouTube Comment Data with Wordfish in R" GitHub link: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r Editor: Nabeel Siddiqui
Overall evaluation This tutorial uses a natural language processing algorithm (Wordfish) to conduct textual analysis from Youtube. It presents a good overview of YouTube as a platform. However, it requires some improvement in clarifying the data collection method to the reader. Moreover, the tutorial can benefit from some work on reorganising the order of sections and, in some cases, renaming the headings and subheadings. Finally, bullet point lists, annotated screenshots, gifs and short videos are recommended as pedagogic tools to be considered in this tutorial.
Review statement The tutorial review will follow a bullet-point mode, proposing suggestions, providing feedback and raising questions to the authors.
Part I: Introduction to YouTube Scraping and Analysis
Define web scraping and crawling, as the academic community still does not fully understand these data collection methods.
This section would benefit from one or two paragraphs presenting how YT has been studied and reviewing existing YouTube-related tutorials. This would help the author to situate the proposed tutorial yet explain to the reader its relevance and how it differs from others.
Is there a reference (i.e. a paper, GitHub repository, white paper) for the Wordfish algorithm? If so, I recommend that the author include it in the text. Moreover, all tools or scripts in use or mentioned in the text also deserve a proper reference ;)
Data collection methods refer to different technicalities; for example, web scraping, crawling or API calling has different features and functions to help scholars with the task of building a dataset. The authors mention that they have used the YouTube Data API to retrieve data from the platform. However, this section's title and subtitle using "scraping" as a method. So, what does this tutorial propose as the data collection method? Scraping the front-end interface of YT or making API calls for its API? From the former, one extracts data, while from the latter, one requests and retrieves data from an API. To help the reader understand and follow up on the tutorial and data collection method, I recommend the author make this clear. If it helps, I'm happy to share the pdf files of two-part guides offering an overview of the knowledge needed to collect data using APIs (see: https://dx.doi.org/10.4135/9781529611441 and https://dx.doi.org/10.4135/9781529611458).
As for research ethics, AoiR provides good guidelines that the author should consider including in the tutorial, i.e. Markham AN, Buchanan E (2012). Ethical decision-making and internet research, recommendations from the AoIR ethics working committee (version 2.0). Retrieved from: http://aoir.org/ reports/ethics2.pdf and Markham A (2017). Impact model for ethics: notes from a talk. Retrieved from: https:// annettemarkham.com/2017/07/impact-model-ethics/. Therefore, providing more concrete perspectives to the reader
The subsection "Introducing the Wordfish Text Mining Algorithm" could have appeared sooner in the text, as it explains the main objectives of the tutorial and what is necessary to do for those interested in it.
Part II: Scraping the YouTube API
Please see my comments and suggestions on the data collection method to reconsider the data collection method used in the title. Keep in mind that one makes API calls to request and retrieve data (not to scrape data, for this a web scraper would do the job) ;)
Maybe provide a table showing the API quota limits to comments based on YT Data API? This could be a valuable source for the reader.
Bullet points and a short title can help when explaining step-by-step procedures, for example:
How to create YT credentials? 1. First, xxxxxxxx
Screenshots, gifs or short videos are often super helpful for these tasks.
Suggestion: maybe rephrasing this subtitle "Making a list of videos" to something like "How to create YouTube comments dataset?" Also, it would help to provide a visual protocol summarizing all possibilities for this type of dataset building (i.e. video or channel ids and keywords as entry points for retrieving comments), while using the video comments as a practical example. This visual protocol should also include the requirements for using predictive modelling.
A nice proposal the one of the code chunk to combine video metadata with the comment text and comment metadata while renaming some columns for clarity :)
Part III: Optimizing YouTube Comment Data For Wordfish
Recommendation: Use "Now that the comments are retrieved" rather than "scraped".
Beyond explaining how Wordfish models work (great job here!), I recommend the author provide a concrete and short example, so the reader can also perceive how it works in practical terms. This would help one to envision the methodological potentials of Wordfish.
I wonder if Wordfish models read mentions (@name), hashtags, links, and emojis. These relevant and valuable objects and actions would bring more context and richness to the textual analysis. Instead, the analysis automatically ignores YouTube usage practices in comments by removing mentions, hashtags, links and emojis. I want to invite the authors to reflect on this matter.
Part IV: Modeling YouTube Comments in R with WordFish
The opening section brings a detailed explanation about WordFish, comparing this model wit topic modelling. It is an excellent subsection because it provides further information about the model. However, it should have appeared earlier when the author introduces WordFish. At this point, the reader is expecting to see run the model and check the outputs of WordFish for YT comments, learning how to read and interpret them.
I wonder if the authors suggest an alternative visualisation to analyse the top words. Perhaps, one that would avoid overlapping the words and facilitate its interpretation.
Overall, I think the tutorial is very useful and well put together. I agree with many of Dr. Omena’s suggestions regarding clarity and usability. If following the tutorial one piece at a time, a user is likely to successfully gather and visualize this data. However, I think an alternate organization that begins with a clear justification for this method, overviews the functionality of each tool, and then ends in a more clear-cut set of directions would be useful. I also think more or different examples (see comments below) would add to the clarity of the tutorial and help researchers know when and how to employ these methods. I think most of my suggestions or comments come back to that idea: a tutorial is most useful when a user knows when and why to deploy it, not just how. I think more clarity in that regard would be useful.
Opening: It would be useful if there was a justification for why a person would need to know using R or scaping data/crawling. I think there is less need to justify YouTube as a site of research and more need to demonstrate to the audience what kinds of questions/problems this method of scraping and visualizing data might address. If the audience is scholars who don’t know how to use these kinds of methods, they’ll need to be able to make the connection between this method and their overarching goals.
P35-36: This section is a bit confusing. It’s unclear how many words/comments would be needed to do a successful model—an example would be useful here. Maybe a link to a video with comments that are successful or linked csv that demonstrates comments.
P38: There is guidance here about what is “better”, but what is “better” is determined by a researcher’s purpose and dataset. The use of “better” here is somewhat confusing to me because I am not sure what the goal is. If the goal is to model differential data points, then this makes more sense, but I think I’m confused about the purpose of the model.
P54: If a person is doing the tutorial as they are reading it, I think they’ll do ok with keeping up with the steps. However, it would be challenging to retrace steps or review steps quickly because steps are so embedded. It might be nice for there to be a clearer set of step-by-step instructions to revisit so that a person isn’t looking for signal phrases like “Now that you’ve finalized your list of videos, gathered the metadata and comments for each video, and optimized your data for Wordfish through filtering, you are ready to clean and wrangle the data!”
P62-63: It would be nice to get a clearer sense of which “approaches” are described here and when/why a person might choose one over another.
P87: While I understand that a user can manipulate the visualizations crated in WordFish, these models don’t feel especially useful to me because they are illegible. I think demonstrating some simplified visualizations might be useful and/or providing alternate formats where the visualization is more usable.
Thanks @jannajoceli and @hlang264 for your helpful feedback! @nabsiddiqui is there anything you were going to add that we should take into consideration before starting to revise?
@hawc2 I was just going to summarize what the other reviewers said, but I think you can go ahead and start revising based on these suggestions. If there is anything you think isn't particularly relevant, we can discuss that later if you don't want to put it into the revision.
hey all, just to say we've almost finished revising this lesson. I'm going to do some last tweaks and test of the code next week, and then I'll ping you all when it should be ready for a final review.
A note to say that the lesson markdown file has been renamed, and the lesson slug:
adjusted accordingly.
I have also:
Thank you @anisa-hawes! @nabsiddiqui, I think this draft should be ready for final review by the peer reviewers.
I'm curious to hear how it goes when you test the code, it seems to be working on our end, but things can get especially wonky with the first part about accessing the API.
@nlgarlic and I are available to make further revisions and updates, and are happy to chat about any of our decisions here. We did include information on setting you an account to access the YouTube API, but we are still leaning a bunch on the Google documentation because it keeps changing.
Thanks everyone for your patience with the time it took us to update this lesson, and we look forward to next steps!
Hey @hawc2. It will likely be about another two weeks until I get to this, but I have placed a reminder in my calendar to come back to it soon.
@nabsiddiqui any updates on this lesson's timeline from here?
Hi there,
Thanks for your patience with this. I've just had a chance to review the changes, and I think the final draft is a great contribution. Thanks for your work on this--I'm excited to see how folks use it. I think this is ready to move on to the next stage.
Best wishes,
Heather
Hello @nabsiddiqui @hawc2 and everyone. Thanks for this lesson, it would be great to see it published - especially since people who - like me - can no longer collect easily twitter data, are looking for ways to collect other social media data. Thus, if I may, I would like to ask how the code could be adapted to collect data from the live chat comments available sometimes on the right side f the video (and not from the comments that appear under a video). I do not find a lot of resources out there explaining how to do this. Thanks a lot!
@nabsiddiqui I'll meet with my co-authors this week to discuss final revisions we will make to the lesson. We are aiming to complete revisions for this lesson within the next few weeks. Please let us know if there's anything additional we should take into consideration at this time. Partly what we aim to do is make sure nothing has changed in the YouTube API that could affect the lesson now.
@spapastamkou thanks for your encouraging thoughts! Agreed that it's valuable to offer tutorials on other social media platforms less restrictive than X/Twitter. I'll chat with my co-authors about your question regarding live chat comments, I remember us looking into this at one point and thinking it seemed doable but outside the scope of this lesson. We could at least add some brief info about that in the lesson to point people in the right direction.
thanks @hawc2 !
I looked through the wording, etc. and I think we can now close out this portion of the review process. So, I think we are ready to start working on next steps @anisa-hawes and @hawc2
@nabsiddiqui @anisa-hawes just a heads up, we are making final edits on the lesson now. We've done some thinking and made a few changes to the part of the project requiring access to the Google API that I think will make the lesson much more sustainable.
Once this round of edits is done, I'll hand it over to @anisa-hawes for a final look over. @anisa-hawes I think it will be necessary for you to take on the ME role in this case, doing one last read through of the lesson for quality control, and giving us a last round of edits before it is sent to copyeditor for preparation to be published. Let's aim to publish this lesson in early 2024?
Hey Charlotte, I thought Anisa was going to do a read through and give us feedback first? It's ok if she wants to do it after copyedits, but I just want to flag that since I am the author on this, I can't do the final review as ME, so I was hoping Anisa could provide us feedback for one last round of review. I also have a few edits I need to make that I can make as soon as today if you are about to go into copyedits. Let me know if I can still do that or if I should wait. My recommendation in the future would be to double check with authors they are done editing before bringing things into copy edits, as my last communication with Anisa via slack had not suggested this was ready for copy edits quite yet. Best Alex
On Fri, Feb 16, 2024, 5:42 AM charlottejmc @.***> wrote:
Hello @hawc2 https://github.com/hawc2, @jantsen https://github.com/jantsen, and @nlgarlic https://github.com/nlgarlic,
This lesson is now with me for copyediting. I aim to complete the work by ~Friday 08 March.
Please note that you won't have direct access to make further edits to your files during this Phase.
Any further revisions can be discussed with your editor @nabsiddiqui https://github.com/nabsiddiqui, after copyedits are complete.
Thank you for your understanding.
— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/374#issuecomment-1948142664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EEKXFI3O37WA5UHP3LYT4ZZTAVCNFSM46VS3TN2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUHAYTIMRWGY2A . You are receiving this because you were mentioned.Message ID: @.***>
Hi @hawc2, yes, Anisa will be adding her feedback shortly!
I was a little hasty and initially posted a comment + opened a branch to start copyediting, but I quickly deleted both as I realised I wanted to wait for Anisa's comment before I began my work. I imagine the email updates from GitHub might not have reflected that on your side.
There's no rush for you to make edits just now.
My apologies for the confusion!
Hello @hawc2,
Thank you again for the opportunity to read this co-authored lesson text-mining-youtube-comments
. I think it is excellent. I can see that you've made some substantial revisions (https://github.com/programminghistorian/ph-submissions/commit/c4d044d0103461c93497ce066fadb2e10a779d42 and https://github.com/programminghistorian/ph-submissions/commit/bea9140a7fc2bb014a4555f6e22fdd2dd5c89461) since I shared my feedback with you by email.
Mapping my initial feedback against the current draft here on GitHub, I thought it might be useful to set out my remaining suggestions as optional tasks / or ideas for you to think-through and reject. I'll check off the suggestions/points which I think you have already resolved.
Overall, I suggested that the lesson structure might be revised so that table of contents is simplified. At the moment, the sub-sections are very granular and, from my point of view as a reader, this made navigation quite confusing. I think the following revisions could enable you to provide readers with a broader overview of the lesson as a whole, as well as clear signposts into specific parts. It seems to me that the high-level chapter headings could be:
Introduction #comprising your overview of the method + ethical considerations
Data Collection #comprising set up + install + getting started with the tools to download metadata and comment
Cleaning and Modeling #comprising removing stopwords, applying filters to clean your data, modeling the columns
Analysis
Visualisation
Conclusions
Endnotes
Anyway, these suggestions are grouped by section headings as they are (rather than paragraph/line number, because everything has shifted since I worked through this).
[mention of section](#section-title)
) and add definitions of technical termsstringr
package [...]” reword as per suggestionquanteda
(“at a later stage”)write_csv
function below:”The Wordfish model lends itself to two different kinds of visualizations, one which depicts the scaling of documents, and the other the scaling of words. The below code will create ‘word level’ visualizations which display how terminology is dispersed across the corpus object.
Our project uses custom visualizations, drawing from Wordfish’s underlying statistics and utilizing
ggplot2
. To produce the first type of visualization, run the following code and produce a plot of all unique comment words within the corpus:
Thanks @anisa-hawes. Just to say we will try to get the rest of these changes done in the next couple weeks!
@anisa-hawes I've taken a shot at addressing alot of these remaining edits, and I think we're pretty close to being done. I've checked off everything we've addressed so far. There's still a few content related revisions we need to make, but we probably won't have time until later next week.
I'm wondering if this week you and @charlottejmc can start on the copyedits for this lesson, or at least help us with the remaining checkboxes here that you can address as easily as we can? I think at least 20 of the remaining items here are essentially copyedits.
In response to the edit suggestion for the #Interpretation section, I don't think it makes sense to bring commentary from the Visualization section up to the Interpretation section, as we are speaking about a different level of interpretation at that section. Let us know if you still think there's an issue with repetition
We're mostly working now on your recommendations for the Ethical Considerations section. We might not do all of the edits you've suggested, but we'll try to make it work. Overall, the Introduction section with the two subsequent sections seems to work well to outline the main lesson argument and content. Or I might not be totally understanding what issues some of these remaining edits are trying to address.
The additions to the Downloading Comments and Metadata section make sense and we'll do that by next week.
Thank you, @hawc2. That makes sense.
Absolutely happy to assign this to Charlotte for copyediting this week. She will be pleased to support with resolving any remaining copyedit-like clarifications from the checklist too.
(We can re-read the Ethical Considerations sections when you've made any additional edits that you feel are suitable).
Hi @hawc2, @jantsen and @nlgarlic,
As Anisa mentioned, I'm very happy to start copyediting the lesson in its current state. I'll open a branch and work on my copyedits this week, including the points from Anisa's checklist which I can take on myself.
I think it would help keep things clear if you held off adding your own edits until I've prepared the Pull Request with my final copyedited version (I anticipate this should be ready by the end of this week, or Wednesday next week).
Once I've tagged you there, we could work together on integrating your outstanding changes to the clean copy in the copyedit branch, before merging everything at once. That should help avoid any conflicts with the main branch!
Hello @hawc2, @jantsen, @nlgarlic, and @nabsiddiqui,
I've prepared a PR with the copyedits for your review.
There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find instructions for how to reply to my in-line comments, and how to make the outstanding changes which I know you have still been working on.
Hello @hawc2, @jantsen and @nlgarlic,
Thank you very much for addressing my comments in the copyedit PR. I've now merged this in to update the lesson preview, which you can refer to in order to keep making the changes you are still working on.
I was not able to resolve all of my comments directly in the PR, so I'll list the outstanding points below to help guide your work:
alt="[Chart type] of [data type] where [reason for including chart]"
Thanks @charlottejmc. I've made a series of edits resolving some of the outstanding issues. My co-authors are going to reread the lesson in full one last time over the next week, and we'll try to finalize everything by mid-May.
In the meantime, can I send you an additional screenshot image and links to add? I can work with you on resolving some of these citation and formatting details.
Hello @hawc2,
Thank you for the opportunity to re-read this lesson. I’ve suggested a couple of very small line copyedits (https://github.com/programminghistorian/ph-submissions/commit/dd57637f60c57cd61f05a6cebe1ba465786f8a71), and have a final few comments to share below (paragraph numbers refer to the preview):
contents:
As discussed yesterday, I think that clarifying the three main sections of the lessons using the headings you suggested (Data Collection, Data Preparation, and Data Analysis) sounds very sensible, and I think it would benefit the reader. I tend to find granular sub-sections less useful, unless they delineate key actions within a section.
A suggestion could be:
## Introduction
## Data Collection
? ### Set up [Setting up your coding environment]
## Data Preparation
## Data Analysis
## Conclusion
## Endnotes
para 1:
I note this is an advanced difficult: 3
lesson. I would like to suggest that you include a sentence in paragraph one which establishes that. For example, something like: This lesson is aimed at those who have an established understanding of Natural Language Processing (NLP), and particularly those who have developed experience of textual analysis.
Reviewing the difficulty matrix, I think the key things are to be upfront about the prior knowledge and applied experience that are required. I think you can include specialist and technical terms confidently within that context, then explain concepts that could still be new.
Otherwise I think, for example, the final sentence of paragraph 1 marks a shift into a domain of knowledge that the initial sentences don’t prepare me for. I’m not necessarily suggesting adding a link or integrating a definitions to: primary dimensions at this difficulty level (because this is above my learning-level, I don't know if it is 'new'), but that immediately stands out to me as something which might need to be indicated is part of the prerequisite knowledge.
para 4-6:
Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.
para 9:
You refer to ‘recent scholarship’ and ‘a wide range of qualitative sociological studies’, which I think it would be useful to either cite, or collate as further readings.
para 11:
You describe the dataset as ‘expansive’. I just wanted to query this. Am I mis-remembering the size of the sample dataset? I recall that you’d previously mentioned a dataset of ~6 videos (provided that they have ~2000 comments combined).
para 1-16:
For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copyedits if you agree.
para 31:
I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations.
para 70:
I note that the term psi
values doesn’t appear to have been mentioned previously, and isn’t expanded/defined here.
para 103:
You give the example of adding the contraction didn’t to your stopwords list, but in the previous step you mention that you have used the token
function to remove punctuation. Would apostrophes in elided word forms be an exception?
figures 3 + 4
I wonder if you might consider adding in a fully zoomed-in view of the each of the word clusters as additional figures? I found the interpretation of Figures 3 and 4 difficult to follow because the words are close-to being indecipherable.
para 123:
I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent.
figure 5:
I note here that the terms used as axis labels on the graph are new to me: theta and psi and don’t appear to be explained in the lesson.
paras 133-135:
I wonder if you might take the opportunity to reflect on your finding (para 131) that the channel’s political affiliation does not seem to be a strong predictor of the commenters’ political positions. The questions in my mind: Are there learnings you could share about the specific things you might change (perhaps a different selection of videos, or a larger corpus of comment data)? Despite the 'polarized' political positions of the channels you selected from (according to the allsides ranking), were commenters equally engaged in discussion? Would comparison visualisations of comments responding to individual videos provide greater insight? What other kinds of research questions would be most usefully explored using this method?
Thank you @anisa-hawes! I think we're almost there. I've done another round of thorough edits, and we only have a few last issues to address (including we identified some recent updates to relevant libraries that will affect the code ever so slightly). While we work on those last items, I'm handing edits back to you to address some of the outstanding issues you said you could handle.
You can go ahead and do these: para 31: I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations. para 123: I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent. para 4-6: Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.
One thing you said I wasn't quite sure about. You said: "para 1-16:For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copy edits if you agree." I think to some extent the current structure was based on your recommendation to put the key steps of the lesson clearly up front. I might be misunderstanding what reorg you intend to do here, but feel free to take a shot at it. There is some repetition between the intro and the Scholarly Research and Learning Outcomes sections in the Intro, but it doesn't seem to require a major reordering.
If you have ideas for how to further streamline the Table of Contents and headings, please go ahead and do those edits as well. We can review your changes and decide if we agree, as it's difficult to describe otherwise.
We also have a few more images to add, that I can send to you for uploading in the coming days/week. Hopefully we can be done with all edits on our end by June 6th at the latest!
Hi @hawc2,
I've added in the new images you sent over Slack. I've gone through and ironed out the filenames and numbers to get a little clarity: we now have 9 figures, numbered 01-09 (not 3a, 3b, etc.).
Just a couple notes:
alt-text
should go further than repeating the figure captions.We have found Amy Cesal's guide to Writing Alt Text for Data Visualization useful. This guide advises that alt-text for graphs and data visualisations should consist of the following:
alt="[Chart type] of [data type] where [reason for including chart]"
What Amy Cesal's guide achieves is prompting an author to reflect on their reasons for including the graph or visualisation. What idea does this support? What can a reader learn or understand from this visual?
The Graphs section of Diagram Center's guidance is also useful. Some key points (relevant to all graph types) we can take away from it are:
For general images, Harvard's guidance notes some useful ideas. A key point is to keep descriptions simple, and adapt them to the context and purpose for which the image is being included.
Would you feel comfortable making a first draft of the alt-text for each of the figures? This is certainly a bit time-consuming, but very worthwhile in terms of making the lesson accessible to the broadest possible audience.
Do let us know when you are ready for to hand the lesson over for copyediting.
Thank you very much! ✨
@charlottejmc we have our final edits ready, I'll send them to you via email and you can go ahead with copyedits. Thanks!
Dear @hawc2,
Thank you for providing your final Phase 5 edits over email. I've now applied these to the lesson file. Just a couple points to go over together:
The Wordfish model assigns two parameters to each word used in the corpus studied (beta and psi), and a similar two to each document (alpha and theta).
However, I still see 'psi' on the y-axis for Figure 9 itself. Would we need to update the image? I also went back through the issue to look for any un-ticked checkboxes that may have been lost in the back-and-forth.
From Anisa's comment:
From my comment:
comments.csv
files in the ytdt_data
folder" – however, having downloaded ytdt_data.zip
, I see that the comments.csv
and basicinfo.csv
files are nested within the folders for each videoID. Am I understanding correctly that the code will know to go through these videoID folders to find the comments.csv
and basicinfo.csv
files within? From your comment:
I understand you may not want to make all of these changes (quite a few of them are simply suggestions), but it would be very helpful if you could give a final sign-off on what you think of them!
Thanks so much for your work on this. ✨
@charlottejmc we're working on a final round of edits of this draft, for you to be able to take over for final copyedits by Wednesday, the 24th of July. I'm adding additional thoughts and comments here to address the remaining concerns and questions. There's a few details here that we are leaving for you and @anisa-hawes to decide whether they should be included during copyedits, such as specific references addressing gaps you mentioned.
[x] "Figure 9: AT line 458, you say The Wordfish model assigns two parameters to each word used in the corpus studied (beta and psi), and a similar two to each document (alpha and theta). However, I still see 'psi' on the y-axis for Figure 9 itself. Would we need to update the image?"
[x] "On a similar note, would you be able to give a clearer explanation of what exactly alpha, beta, and psi are? You've also described them as 'model outputs' or 'parameters' interchangeably, which seems a little surprising."
[x] "Please can you confirm the asset link for the sample dataset you're providing readers, and confirm whether the code at line 199 refers to it correctly?"
This is correct as is. We performed a final check here, and the approrpiate data was available within the github zip file, with identical file structure. The list.files() function does search within directories and subdirectories.
[x] Below are 3 links to info on stemming and lemmatization, with a quick note on why we might or might not want to include. We are fine with Wikipedia as an alternative. Alot of examples about code uses python; the tutorials in R were much less clear as to the actual substantive discussion of the terminology ** https://ayselaydin.medium.com/2-stemming-lemmatization-in-nlp-text-preprocessing-techniques-adfe4d84ceee
This is a very short read, and clearly distinguishes between the two approaches, following up with a bit of python code. ** https://www.datacamp.com/tutorial/stemming-lemmatization-python
Longer read, but gives specific pros and cons to each approach which I really like.
** https://keremkargin.medium.com/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768
-- More of a step-by-step guide to NLP in python, though the specific examples of stemming vs lemmatization are early on & are very clear and to the point.
[x] "IMPORT DATA - Share a template displaying the required column names + order (in raw Markdown) for readers who “choose to use a YouTube comment dataset downloaded with a tool other than YouTube Data Tools”. (If you let us know the column names + order, @charlottejmc and I could create this in raw Markdown for readers)."
Variables listed below are needed for initial input - note that these are all character vector inputs, and that the additional columns utilized in the lesson are added with the lesson code itself
Columns: 7
$ videoId
EDIT considering the recent edits which removed the quoted sentence, we think it makes sense without the template columns
Two small issues we noticed I'm flagging here, will add additional notes later if our edits bring up other questions:
Sincere thanks, @hawc2, for all the energy you, Nicole @nlgarlic, and Jeff @jantsen have dedicated to finalising your revisions.
This lesson is now with @charlottejmc for copyediting who aims to complete the work by Friday 2nd August.
@hawc2, @nlgarlic and @jantsen please don't make further edits to your files during this Phase.
Any further revisions can be discussed after copyedits are complete.
Thank you for your understanding.
Hello @hawc2, @nlgarlic,@jantsen and @nabsiddiqui. I've prepared a PR with the copyedits for your review.
There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find brief instructions for how to reply to any questions or comments which came up during the copyedit.
When you're both happy, we can merge in the PR.
Thank you!
Hello @hawc2,
This lesson's sustainability + accessibility checks are in progress.
Publisher's sustainability + accessibility actions:
Authorial / editorial input to YAML:
difficulty:
, based on the criteria set out hereactivity:
this lesson supports (acquiring, transforming, analysing, presenting, or sustaining) Choose onetopics:
(api, python, data-management, data-manipulation, distant-reading, get-ready, lod ["Linked Open Data"], mapping, network-analysis, web-scraping, website ["Digital Publishing"], r, machine-learning, creative-coding, web-archiving or data-visualization) Choose one or more. Let us know if you'd like us to add a new topic. Topics are defined in /_data/topics.yml.alt-text
for all figuresabstract:
for the lessonThe image must be:
- copyright-free
- non-offensive
- an illustration (not a photograph)
- at least 200 pixels width and height Image collections of the British Library, Internet Archive Book Images, Library of Congress Maps as well as their Photos/Prints/Drawings or the Virtual Manuscript Library of Switzerland are useful places to search
avatar_alt:
(visual description of that thumbnail image)ph_authors.yml
using this template:- name: Forename Surname
orcid: 0000-0000-0000-0000
team: false
bio:
en: |
Forename Surname is an Assistant Professor in the Department of Subject at the University of City.
es: |
Forename Surname es profesor adjunto en el Departamento de Asignaturas de la Universidad de Ciudad.
fr: |
Forename Surname est professeur assistant au département des matières à l'université de City.
pt: |
Forename Surname é professor assistente no Departamento de Assunto da Universidade da Cidade.
Files we are preparing for transfer to Jekyll:
EN:
Promotion:
ph-evergreens-twitter-x
spreadsheet that has been shared with you, or email them to me at publishing.assistant[@]programminghistorian.org.Hello Alex @hawc2, Nicole @nlgarlic, Jeff @jantsen, and @nabsiddiqui,
We're very close to publication!
A few final elements to confirm from Charlotte's comment above:
- name: Nicole 'Nikki' Lemire-Garlic
orcid: 0000-0002-8988-5188
team: false
bio:
en: |
Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.
- name: Jeff Antsen
orcid: 0000-0002-9787-7583
team: false
bio:
en: |
Jeff Antsen is the owner of Half-Moon Research, an independent consultancy specializing in market research and mixed-methods approaches.
If we have everything together, we could publish this mid-week next week.
Thank you, Anisa
Thanks, Anisa! I’ve attached a signed copy of the declaration form and here is my info:
Name: Nicole “Nikki” Lemire-Garlic Orcid: https://orcid.org/0000-0002-8988-5188 Bio: Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.
Thank you, @nlgarlic!
Please could you email the copyright form to Charlotte? > publishing.assistant[@]programminghistorian.org. (We don't receive attachments within replies to GitHub comments).
--
As Alex @hawc2 is needing to play the role of both ME + co-author, I've done a final read-through of this lesson, and have made a few line edits: https://github.com/programminghistorian/ph-submissions/commit/25191046f4d5848ad5d45e51a8cdd2cf55d7dae9 (+ a small correction of my own mistake! https://github.com/programminghistorian/ph-submissions/commit/e9609b4ab64526c2c65941e9d9a92b064ecacd5a)
The Programming Historian has received the following tutorial “Text Mining YouTube Comment Data with Wordfish in R” by @hawc2, @jantsen, and @nlgarlic. This lesson is now under review and is available here:
http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.
Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.
Anti-Harassment Policy
This is a statement of the Programming Historian’s principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.