programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons

http://programminghistorian.github.io/ph-submissions

139 stars 114 forks source link

Text Mining YouTube Comment Data with Wordfish in R #374

Closed nabsiddiqui closed 3 months ago

nabsiddiqui commented 3 years ago

The Programming Historian has received the following tutorial “Text Mining YouTube Comment Data with Wordfish in R” by @hawc2, @jantsen, and @nlgarlic. This lesson is now under review and is available here:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian’s principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

nabsiddiqui commented 3 years ago

This is a very interesting tutorial that I think our audience will enjoy. There are two macro issues that need to be addressed before moving forward and some small typos suggestions:

Macro Issues

The code blocks throughout this tutorial need to contain comments. Right now, they are difficult to follow.
When submitting the tutorial, it would be better if it was turned into a Markdown file first so that those that are reviewing it can follow it more easily. The code should still be available to create the visualizations and the visualization should be below the code. There is information on how to do that here: https://bookdown.org/yihui/rmarkdown/markdown-document.html.

Micro Issues

Paragraph 1 in Introduction to YouTube Scraping and Analysis-Remove the second sentence
Paragraph 2 in Introduction to YouTube Scraping and Analysis-Remove third word ("also")
Paragraph 3 in Introduction to YouTube Scraping and Analysis-Remove "both" in first sentence
Paragraph 3 in Introduction to YouTube Scraping and Analysis-"with the formation of organizations such as the Association of Internet Researchers"
Paragraph 5 in Introduction to YouTube Scraping and Analysis-"Through this tutorial, you will learn how to access the YouTube API, process and clean the video metadata, and analyze the comment threads for ideological scaling."
Paragraph 8 and 9 in Introduction to YouTube Scraping and Analysis-Do not need these two paragraph
Paragraph 2 in Scraping the YouTube API-beware should read "be aware"
After Paragraph 3 in Configuring Your Code-Need a screenshot here
Final Paragraph in Configuring Your Code-"on the Github repository" should read "in the GitHub repository"
Last two paragraphs in Configuring Your Code-last two paragraphs should be combined

Please let me know what your timeline is on this and any questions you may have @hawc2, @jantsen, and @nlgarlic.

hawc2 commented 3 years ago

Thanks @nabsiddiqui. I made the minor edits you mentioned, except Paragraph 8 and 9 seem worth keeping, perhaps as a footnote?

I also wasn't sure what screenshot we should insert for Paragraph 3 in Configuring your Code?

We will update the code to include more comments as you ask, and once we get the knitting to work correctly with the .rmd file, we will update the Github repo with the proper formatting for the markdown file.

We aim to be done with our edits next week. I'll update you when the file is ready for review.

nabsiddiqui commented 3 years ago

Sounds good @hawc2. Let me know if you need anything else on my end. And yes, paragraph 8 and 9 can be kept as footnotes.

svmelton commented 2 years ago

Hi @nabsiddiqui and @hawc2! Just checking in to see if you needed any help moving this lesson forward.

nlgarlic commented 2 years ago

Hi - thanks for reaching out. We actually just met to go over the changes this week. We expect to have them ready in the next couple of weeks at the latest.

Nikki

Sent from my iPhone

On Dec 9, 2021, at 7:21 PM, Sarah Melton @.***> wrote:

Hi @nabsiddiquihttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnabsiddiqui&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KmaDX2Wh5gX47q%2Fh%2Bj%2B5wgG5oxzB4EnW7L2oXo8HZJw%3D&reserved=0 and @hawc2https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhawc2&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PsSmAC17EazwcsGUqIg2dngvJZoWcyXoKQLoKBbc%2FJA%3D&reserved=0! Just checking in to see if you needed any help moving this lesson forward.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprogramminghistorian%2Fph-submissions%2Fissues%2F374%23issuecomment-990437810&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749146818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UOyQdDS%2FTe4m%2By0a6ih62B%2FMOUKK3VgT6Vlsul%2BMTz4%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANJXLVGU76JBLWWBT5PQIXDUQFBXRANCNFSM46VS3TNQ&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SdTHn7UfETK3%2FfOwLIyKK4qtqp1JuCPxVYWg16JSeVo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uKU2xGRERE9qAqfvshxrC%2FbuxQXwOxjXUOdjbDhpMnQ%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749166735%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9HERDXqjmQyXOw24W%2BWw9sc9X55UyFpVi%2FMOSLN%2Bg78%3D&reserved=0.

nabsiddiqui commented 2 years ago

Hey @svmelton. @hawc2 had requested some additional time to work on this in an email he sent to me. I should have probably communicated that in the issue tracker. But, no I think we are all good on moving the lesson forward as planned.

anisa-hawes commented 2 years ago

Hello all,

Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r

Please let me know if you encounter any difficulties or have any questions.

Very best, Anisa

jenniferisasi commented 2 years ago

@nabsiddiqui we now have a new topic for R for the menu, please include it as "r" in the topics in the lesson metadata. Thanks

anisa-hawes commented 2 years ago

Thank you, @jenniferisasi! I've added this to the YAML for you @nabsiddiqui.

hawc2 commented 2 years ago

@nabsiddiqui we’re excited to report we finished updating our YouTube scraping tutorial and it should now be ready for review: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r

Apologies for delays - after we submitted this lesson earlier in the pandemic, we discovered that there were a few sustainability issues, especially involving some of the libraries we were using for text wrangling data into WordFish. We’ve switched to quanteda, and in the process, we condensed the lesson and hopefully simplified/clarified a few sections. We’ve made some other updates as well in order to streamline the code, including removing specific directions for setting up access to the YouTube API, since those directions seem to be changing regularly, and we can link to the Google page with the directions.

We also removed a few options for granular scraping, including a way to search for videos through the API. We intend to provide some of these alternatives on our Github page for those who’d like to explore further. Near the end, we’ve reduced the number and complexity of visualizations, but if reviewers think it needs more, we could build that section out more. Probably some steps in the newer version of the code need to be explicated further.

We look forward to getting feedback on this lesson!

nabsiddiqui commented 2 years ago

Dr. Heather Lang @hlang264 and Dr. Janna Joceli Omena @jannajoceli have graciously agreed to serve as reviewers. We are shooting for a late August response for now @hawc2.

hawc2 commented 2 years ago

Thanks @nabsiddiqui.

One update - YouTube has created a streamlined path for Researchers looking to access their API: https://research.youtube/

This might make some aspects of our tutorial easier, and may require some updating. We'll investigate how this process works and update our tutorial accordingly in the fall after we receive reviewer feedback.

jannajoceli commented 2 years ago

Tutorial review by Janna Joceli Omena

Title: "Text Mining YouTube Comment Data with Wordfish in R" GitHub link: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r Editor: Nabeel Siddiqui

Overall evaluation This tutorial uses a natural language processing algorithm (Wordfish) to conduct textual analysis from Youtube. It presents a good overview of YouTube as a platform. However, it requires some improvement in clarifying the data collection method to the reader. Moreover, the tutorial can benefit from some work on reorganising the order of sections and, in some cases, renaming the headings and subheadings. Finally, bullet point lists, annotated screenshots, gifs and short videos are recommended as pedagogic tools to be considered in this tutorial.

Review statement The tutorial review will follow a bullet-point mode, proposing suggestions, providing feedback and raising questions to the authors.

Part I: Introduction to YouTube Scraping and Analysis

Define web scraping and crawling, as the academic community still does not fully understand these data collection methods.
This section would benefit from one or two paragraphs presenting how YT has been studied and reviewing existing YouTube-related tutorials. This would help the author to situate the proposed tutorial yet explain to the reader its relevance and how it differs from others.
Is there a reference (i.e. a paper, GitHub repository, white paper) for the Wordfish algorithm? If so, I recommend that the author include it in the text. Moreover, all tools or scripts in use or mentioned in the text also deserve a proper reference ;)
Data collection methods refer to different technicalities; for example, web scraping, crawling or API calling has different features and functions to help scholars with the task of building a dataset. The authors mention that they have used the YouTube Data API to retrieve data from the platform. However, this section's title and subtitle using "scraping" as a method. So, what does this tutorial propose as the data collection method? Scraping the front-end interface of YT or making API calls for its API? From the former, one extracts data, while from the latter, one requests and retrieves data from an API. To help the reader understand and follow up on the tutorial and data collection method, I recommend the author make this clear. If it helps, I'm happy to share the pdf files of two-part guides offering an overview of the knowledge needed to collect data using APIs (see: https://dx.doi.org/10.4135/9781529611441 and https://dx.doi.org/10.4135/9781529611458).
As for research ethics, AoiR provides good guidelines that the author should consider including in the tutorial, i.e. Markham AN, Buchanan E (2012). Ethical decision-making and internet research, recommendations from the AoIR ethics working committee (version 2.0). Retrieved from: http://aoir.org/ reports/ethics2.pdf and Markham A (2017). Impact model for ethics: notes from a talk. Retrieved from: https:// annettemarkham.com/2017/07/impact-model-ethics/. Therefore, providing more concrete perspectives to the reader
The subsection "Introducing the Wordfish Text Mining Algorithm" could have appeared sooner in the text, as it explains the main objectives of the tutorial and what is necessary to do for those interested in it.

Part II: Scraping the YouTube API

Please see my comments and suggestions on the data collection method to reconsider the data collection method used in the title. Keep in mind that one makes API calls to request and retrieve data (not to scrape data, for this a web scraper would do the job) ;)
Maybe provide a table showing the API quota limits to comments based on YT Data API? This could be a valuable source for the reader.
Bullet points and a short title can help when explaining step-by-step procedures, for example:

How to create YT credentials? 1. First, xxxxxxxx

Then, xxxxxxxxx
Finally, xxxxxxxx

Screenshots, gifs or short videos are often super helpful for these tasks.

Suggestion: maybe rephrasing this subtitle "Making a list of videos" to something like "How to create YouTube comments dataset?" Also, it would help to provide a visual protocol summarizing all possibilities for this type of dataset building (i.e. video or channel ids and keywords as entry points for retrieving comments), while using the video comments as a practical example. This visual protocol should also include the requirements for using predictive modelling.
A nice proposal the one of the code chunk to combine video metadata with the comment text and comment metadata while renaming some columns for clarity :)

Part III: Optimizing YouTube Comment Data For Wordfish

Recommendation: Use "Now that the comments are retrieved" rather than "scraped".
Beyond explaining how Wordfish models work (great job here!), I recommend the author provide a concrete and short example, so the reader can also perceive how it works in practical terms. This would help one to envision the methodological potentials of Wordfish.
I wonder if Wordfish models read mentions (@name), hashtags, links, and emojis. These relevant and valuable objects and actions would bring more context and richness to the textual analysis. Instead, the analysis automatically ignores YouTube usage practices in comments by removing mentions, hashtags, links and emojis. I want to invite the authors to reflect on this matter.

Part IV: Modeling YouTube Comments in R with WordFish

The opening section brings a detailed explanation about WordFish, comparing this model wit topic modelling. It is an excellent subsection because it provides further information about the model. However, it should have appeared earlier when the author introduces WordFish. At this point, the reader is expecting to see run the model and check the outputs of WordFish for YT comments, learning how to read and interpret them.
I wonder if the authors suggest an alternative visualisation to analyse the top words. Perhaps, one that would avoid overlapping the words and facilitate its interpretation.

hlang264 commented 2 years ago

Overall, I think the tutorial is very useful and well put together. I agree with many of Dr. Omena’s suggestions regarding clarity and usability. If following the tutorial one piece at a time, a user is likely to successfully gather and visualize this data. However, I think an alternate organization that begins with a clear justification for this method, overviews the functionality of each tool, and then ends in a more clear-cut set of directions would be useful. I also think more or different examples (see comments below) would add to the clarity of the tutorial and help researchers know when and how to employ these methods. I think most of my suggestions or comments come back to that idea: a tutorial is most useful when a user knows when and why to deploy it, not just how. I think more clarity in that regard would be useful.

Opening: It would be useful if there was a justification for why a person would need to know using R or scaping data/crawling. I think there is less need to justify YouTube as a site of research and more need to demonstrate to the audience what kinds of questions/problems this method of scraping and visualizing data might address. If the audience is scholars who don’t know how to use these kinds of methods, they’ll need to be able to make the connection between this method and their overarching goals.

P35-36: This section is a bit confusing. It’s unclear how many words/comments would be needed to do a successful model—an example would be useful here. Maybe a link to a video with comments that are successful or linked csv that demonstrates comments.

P38: There is guidance here about what is “better”, but what is “better” is determined by a researcher’s purpose and dataset. The use of “better” here is somewhat confusing to me because I am not sure what the goal is. If the goal is to model differential data points, then this makes more sense, but I think I’m confused about the purpose of the model.

P54: If a person is doing the tutorial as they are reading it, I think they’ll do ok with keeping up with the steps. However, it would be challenging to retrace steps or review steps quickly because steps are so embedded. It might be nice for there to be a clearer set of step-by-step instructions to revisit so that a person isn’t looking for signal phrases like “Now that you’ve finalized your list of videos, gathered the metadata and comments for each video, and optimized your data for Wordfish through filtering, you are ready to clean and wrangle the data!”

P62-63: It would be nice to get a clearer sense of which “approaches” are described here and when/why a person might choose one over another.

P87: While I understand that a user can manipulate the visualizations crated in WordFish, these models don’t feel especially useful to me because they are illegible. I think demonstrating some simplified visualizations might be useful and/or providing alternate formats where the visualization is more usable.

hawc2 commented 2 years ago

Thanks @jannajoceli and @hlang264 for your helpful feedback! @nabsiddiqui is there anything you were going to add that we should take into consideration before starting to revise?

nabsiddiqui commented 2 years ago

@hawc2 I was just going to summarize what the other reviewers said, but I think you can go ahead and start revising based on these suggestions. If there is anything you think isn't particularly relevant, we can discuss that later if you don't want to put it into the revision.

hawc2 commented 1 year ago

hey all, just to say we've almost finished revising this lesson. I'm going to do some last tweaks and test of the code next week, and then I'll ping you all when it should be ready for a final review.

anisa-hawes commented 1 year ago

A note to say that the lesson markdown file has been renamed, and the lesson slug: adjusted accordingly.

Preview: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments
.md file: /drafts/originals/text-mining-youtube-comments.md
images: /images/text-mining-youtube-comments
assets: /assets/text-mining-youtube-comments

I have also:

Renamed the /images directory (to match the new lesson slug)
Renamed + resized the images (to a maximum of 800 pixels on the longest edge)
Added alt-text placeholder text
Adjusted the liquid syntax so that the image file names are correct
Adjusted numbering of endnotes (they were not in consecutive order)
Moved endnotes to end of document (as is our convention)

hawc2 commented 1 year ago

Thank you @anisa-hawes! @nabsiddiqui, I think this draft should be ready for final review by the peer reviewers.

I'm curious to hear how it goes when you test the code, it seems to be working on our end, but things can get especially wonky with the first part about accessing the API.

@nlgarlic and I are available to make further revisions and updates, and are happy to chat about any of our decisions here. We did include information on setting you an account to access the YouTube API, but we are still leaning a bunch on the Google documentation because it keeps changing.

Thanks everyone for your patience with the time it took us to update this lesson, and we look forward to next steps!

nabsiddiqui commented 1 year ago

Hey @hawc2. It will likely be about another two weeks until I get to this, but I have placed a reminder in my calendar to come back to it soon.

hawc2 commented 1 year ago

@nabsiddiqui any updates on this lesson's timeline from here?

hlang264 commented 1 year ago

Hi there,

Thanks for your patience with this. I've just had a chance to review the changes, and I think the final draft is a great contribution. Thanks for your work on this--I'm excited to see how folks use it. I think this is ready to move on to the next stage.

Best wishes,

Heather

spapastamkou commented 1 year ago

Hello @nabsiddiqui @hawc2 and everyone. Thanks for this lesson, it would be great to see it published - especially since people who - like me - can no longer collect easily twitter data, are looking for ways to collect other social media data. Thus, if I may, I would like to ask how the code could be adapted to collect data from the live chat comments available sometimes on the right side f the video (and not from the comments that appear under a video). I do not find a lot of resources out there explaining how to do this. Thanks a lot!

hawc2 commented 1 year ago

@nabsiddiqui I'll meet with my co-authors this week to discuss final revisions we will make to the lesson. We are aiming to complete revisions for this lesson within the next few weeks. Please let us know if there's anything additional we should take into consideration at this time. Partly what we aim to do is make sure nothing has changed in the YouTube API that could affect the lesson now.

@spapastamkou thanks for your encouraging thoughts! Agreed that it's valuable to offer tutorials on other social media platforms less restrictive than X/Twitter. I'll chat with my co-authors about your question regarding live chat comments, I remember us looking into this at one point and thinking it seemed doable but outside the scope of this lesson. We could at least add some brief info about that in the lesson to point people in the right direction.

spapastamkou commented 1 year ago

thanks @hawc2 !

nabsiddiqui commented 1 year ago

I looked through the wording, etc. and I think we can now close out this portion of the review process. So, I think we are ready to start working on next steps @anisa-hawes and @hawc2

hawc2 commented 11 months ago

@nabsiddiqui @anisa-hawes just a heads up, we are making final edits on the lesson now. We've done some thinking and made a few changes to the part of the project requiring access to the Google API that I think will make the lesson much more sustainable.

Once this round of edits is done, I'll hand it over to @anisa-hawes for a final look over. @anisa-hawes I think it will be necessary for you to take on the ME role in this case, doing one last read through of the lesson for quality control, and giving us a last round of edits before it is sent to copyeditor for preparation to be published. Let's aim to publish this lesson in early 2024?

hawc2 commented 9 months ago

Hey Charlotte, I thought Anisa was going to do a read through and give us feedback first? It's ok if she wants to do it after copyedits, but I just want to flag that since I am the author on this, I can't do the final review as ME, so I was hoping Anisa could provide us feedback for one last round of review. I also have a few edits I need to make that I can make as soon as today if you are about to go into copyedits. Let me know if I can still do that or if I should wait. My recommendation in the future would be to double check with authors they are done editing before bringing things into copy edits, as my last communication with Anisa via slack had not suggested this was ready for copy edits quite yet. Best Alex

On Fri, Feb 16, 2024, 5:42 AM charlottejmc @.***> wrote:

Hello @hawc2 https://github.com/hawc2, @jantsen https://github.com/jantsen, and @nlgarlic https://github.com/nlgarlic,

This lesson is now with me for copyediting. I aim to complete the work by ~Friday 08 March.

Please note that you won't have direct access to make further edits to your files during this Phase.

Any further revisions can be discussed with your editor @nabsiddiqui https://github.com/nabsiddiqui, after copyedits are complete.

Thank you for your understanding.

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/374#issuecomment-1948142664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EEKXFI3O37WA5UHP3LYT4ZZTAVCNFSM46VS3TN2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUHAYTIMRWGY2A . You are receiving this because you were mentioned.Message ID: @.***>

charlottejmc commented 9 months ago

Hi @hawc2, yes, Anisa will be adding her feedback shortly!

I was a little hasty and initially posted a comment + opened a branch to start copyediting, but I quickly deleted both as I realised I wanted to wait for Anisa's comment before I began my work. I imagine the email updates from GitHub might not have reflected that on your side.

There's no rush for you to make edits just now.

My apologies for the confusion!

anisa-hawes commented 7 months ago

Hello @hawc2,

Thank you again for the opportunity to read this co-authored lesson text-mining-youtube-comments. I think it is excellent. I can see that you've made some substantial revisions (https://github.com/programminghistorian/ph-submissions/commit/c4d044d0103461c93497ce066fadb2e10a779d42 and https://github.com/programminghistorian/ph-submissions/commit/bea9140a7fc2bb014a4555f6e22fdd2dd5c89461) since I shared my feedback with you by email. Mapping my initial feedback against the current draft here on GitHub, I thought it might be useful to set out my remaining suggestions as optional tasks / or ideas for you to think-through and reject. I'll check off the suggestions/points which I think you have already resolved.

Overall, I suggested that the lesson structure might be revised so that table of contents is simplified. At the moment, the sub-sections are very granular and, from my point of view as a reader, this made navigation quite confusing. I think the following revisions could enable you to provide readers with a broader overview of the lesson as a whole, as well as clear signposts into specific parts. It seems to me that the high-level chapter headings could be:

Introduction #comprising your overview of the method + ethical considerations
Data Collection #comprising set up + install + getting started with the tools to download metadata and comment
Cleaning and Modeling #comprising removing stopwords, applying filters to clean your data, modeling the columns
Analysis 
Visualisation
Conclusions
Endnotes

Anyway, these suggestions are grouped by section headings as they are (rather than paragraph/line number, because everything has shifted since I worked through this).

Introduction

[x] Add direct links to sections (using this syntax [mention of section](#section-title)) and add definitions of technical terms
[x] Summarise the three lesson outcomes here

YouTube and Discourse Analysis

[x] Make this the opening of your lesson overview, rather than a stand-alone section.
[x] Why are YouTube comments an interesting source of research data? You could consider asking this question more explicitly. Focus on establishing three things: (1) How can this source of [YouTube] comments data be useful in research?
(2) Which methods for collecting/modelling/analysing/visualising the data have been chosen in this lesson and (3) why?
[x] Add initially: YouTube was initially associated…
[x] Edit out the initial clause which compares YouTube to Twitch
[x] Stick to the term “video-sharing platform” throughout the lesson. I think ‘sharing’ successfully encompasses the activities uploading, viewing, commenting.
[x] Remove ‘run the gamut’ and other phrases which would be unlikely to be familiar to non-native English speakers
[x] Remove paragraph that is specific to YouTube’s current user interface
[ ] After “recent scholarship”, link to a selected bibliography of articles to recommend to readers
[x] Move the sentence “YouTube video comments represent a unique body of text, or a "corpus" of discourse, describing how viewers receive and perceive politically charged messages, often from moving image media.” upwards, especially as an answer to the question (1) How can this source of data [YouTube] comments be useful in research?
[x] Remove inverted commas from "corpus". I don’t think this needs inverted commas or quotation marks because you explain it with ‘body of text’.
[x] Merge paragraphs 7 and 4 (as they seem to repeat each other) - (Completely understand if you are inclined to leave this as is)
[x] Add “ future”: These comments often frame future viewers’ encounter with the video content, influencing their interpretations, and inviting participation in the discourse.
[x] I think it is particularly interesting and critically relevant to specify that comments appended to videos can influence future viewers’ encounters. The other interesting chronological factor, is that users’ encounters can be days, months, and years apart. This means that dialogue in comments may be an immediate back-and-forth between individuals, but also can involve extended hiatus and reactivation of discussion between a different group of participants.

Learning Outcomes

[x] Move the three-stage outline of the lesson’s structure to the beginning of the lesson

Data Collection

Ethical Considerations for Social Media Analysis

[x] Shorten this section / or integrate it within your introduction
[x] Pair some of the open questions with practical steps a researcher would take in their work in response
[x] Reflect on one or two actions you took in this case, which were directly informed by the ethical questions you outline here. For example:
- [x] Did you seek content creators’ permissions and how?
- [ ] Did your research group include researchers from the communities represented by this dataset?
[x] When pointing readers to resources, weave in a sentence of two that draws upon your experiences in this case NOTE: ## Installing R and RStudio has been moved down below # Set Up your Coding Environment, instead of in the new section # Data Collection

Accessing YouTube Data

[x] Reword and reorder as per the suggestion
[x] Move “All you need is […]” downwards in the lesson so that it is part of the specific instruction on locating IDs. I’m noting here that there doesn’t appear to be a section titled Keyword Searching, so as things are currently, I wouldn’t know where to go as a reader.
[x] Add section on Keyword Searching?

Video Selection

[x] Reword and reorder as per the suggestion
[x] Clarify whether ~6 videos is considered an optimal number of videos to begin with (provided they have the specified range of ~2000+ comments), or if researchers using this method might also experiment with sets of ~10 or more videos

Downloading Comments and Metadata

[x] Make each paragraph slightly clearer to surface direct instructions
[x] Add a screenshot of the YouTube Data Tools interface.
- [x] Or alternatively (perhaps preferably), a brief orientation for readers. For instance, an explanation of each of the fields in the Parameters section
[x] Answer the questions:
- [x] Can I only enter one video ID at a time, or can I make a list? (separated by commas?)
- [x] What do you recommend I enter into the Limit to: field?
[x] Move p.35 (“For ethical purposes [...]”) up to the introductory section which covers ethical research recommendations
[x] Briefly outline suggestions for each of the Output options
[x] Last four paragraphs (from “You have three choices [...]”) reword and reorder as per the suggestion
- [x] Confirm whether ID numbers are always 11 characters long

Set Up your Coding Environment

Install R and RStudio

[x] Reword as per the suggestion
[x] Write a simple instruction for the reader to download the script and load it into RStudio Desktop instead of pointing to PH assets
[x] Move the sentences about YouTube Data Tools to the specific section about YouTube Data Tools.

Install R Libraries

[x] Explain whether creating a new R script will be based upon/adapted from the R script provided
[x] List the necessary packages to install as bullet pointed items inside a grey box. (If you like this idea, @charlottejmc and I can take care of formatting it for you during typesetting)

Import Data

[x] Reword as per the suggestion
[x] Add colons
- [x] “Next, load in files containing video data:”
- [x] “Now, pivot this data so it is organized by row rather than column:”
- [x] “Finally, run the following code to join the video and comment data:”
[x] Add a link to “Alternatively, if you would like to utilize our sample data, you can download it from the Github repository.”
[ ] Share a template displaying the required column names + order (in raw Markdown) for readers who “choose to use a YouTube comment dataset downloaded with a tool other than YouTube Data Tools”. (If you let us know the column names + order, @charlottejmc and I could create this in raw Markdown for readers).

Data Labeling

[x] Reword as per suggestion
[x] Add the link to “add a [partisan indicator](LINK to definition)”

Pre-processing and Cleaning

[x] Reword as per suggestion

Remove Stopwords and Punctuation

[x] Explain the use of the unusual stopwords words "bronstein", "derrick" and "camry". These are such unusual words, it makes me wonder why they wouldn’t provide any meaningful information? (Perhaps I am missing something?)
[x] “Using the stringr package [...]” reword as per suggestion
[x] Move code block up to follow the sentence “in the following code”
[x] “Note you can also clean the data” reword as per suggestion
[x] Link to the specific lesson section which contains direct instructions for using quanteda (“at a later stage”)
[x] Rephrase and add colon “To export, use the write_csv function below:”
[x] Add “for analysis” (“This data can now be transformed into a Wordfish-friendly format for analysis.”)

Wordfish

[x] Edit out the two initial paragraphs. These are orientation paragraphs, but I think what you have is very clear if you revise/simplify the chapter headings.

Interpretation

[x] “Secondly, Wordfish identifies which specific [...]” reworded as per suggestion
[x] Still need to add link to definition
[x] "The Wordfish model lends itself to two different kinds of visualizations, one which depicts the scaling of documents, and the other the scaling of words. This lesson provides code for both visualizations [in the Visualisation section below](LINK to section)."
[x] I think I was suggesting moving an earlier sentence to here or merging these two paragraphs together to edit out the repetition, and clarify which kinds of visualizations you are going to demonstrate. Something like:

The Wordfish model lends itself to two different kinds of visualizations, one which depicts the scaling of documents, and the other the scaling of words. The below code will create ‘word level’ visualizations which display how terminology is dispersed across the corpus object.

Our project uses custom visualizations, drawing from Wordfish’s underlying statistics and utilizing ggplot2. To produce the first type of visualization, run the following code and produce a plot of all unique comment words within the corpus:

Latent Meaning

[x] “Since YouTube comments are short, you may find some specific examples helpful. [Can you point to examples here?]” - HAS BEEN DELETED, SO SUGGESTIONS DON’T APPLY ANYMORE

Document Feature Matrices (DFM)

[x] Edit out the orientation sentence which begins this paragraph. It seems to me that these paragraphs are still about Analysis.
[x] Delete the sentence: ‘These models do not take into account any information about word order’ which is expressed in the previous.
[x] “Bag-of-words modelling can be problematic [...]” and “The key differences between Wordfish scaling [...]”: reword as per suggestion
[x] Move reference to “the code below” nearer to the actual code block

Create a Corpus in R

[x] Use DFM only (capitalised)
[x] At “introduced above”, link back to the section that explains partisan indicators
[x] Stick to the term partisanship “indicator” OR “variable” - HAS CHOSEN “INDICATOR
[x] Clarify whether a ‘document’ = a single comment or plural comments
[ ] Link to definitions of “stemmed” and “lemmatised”
[x] “At an earlier stage of this lesson”, link to that particular section
[x] Clarify “remove comments with minimal data” – Was this where we filtered out comments with fewer than 10 words?
[x] Clarify which are “the two options” for algorithms the reader can choose. However, the algorithms work slightly differently, so you should test which works best for you and your data - and there’s no harm in using both. Which are the two options you advise readers test?
[x] What kinds of dataset might be better suited to which option?

Select Comments

[x] Remove two repetitive and unclear paragraphs (which appeared before this subheading)
[x] Clarify paragraph and explicitly state the columns that should be included
- [ ] If we choose to share a template displaying the required column names + order, link back to that here

Build Corpus Object

[x] Add colon “Execute the following code to build your corpus:”

Data Transformation

[x] Use DFM only (capitalised)
[x] “Next, we will tokenize [...]” Reword as per suggestion

Data Optimization

[x] “Now, we will optimize the corpus [...]” reword as per suggestion

Verification

[x] (From deleted section ### Verify Top 25 Words) reword as per suggestion

Build Wordfish Model

[x] Move the sentence “The following code creates a Wordfish model based on the corpus of unique comments you have assembled” so that it directly precedes the code block

Unique Words

[x] Add colon “produce a plot of all unique comment words within the corpus:”
[x] Figure 3. adjust scale + text colour in this figure
- [x] Could add a figure which zooms in on a particular area of the word arc
[x] From “On the left, "knee" and "neck" [...] although it is risky to read too much into any single finding.” – very context-specific, link to a contemporary news article in a web archive which explains what happened to George Floyd

Removing Outliers

[x] “We’ve circled in red the words above that stand out in the first visualization”: refer specifically to Figure 3, and add alt-text in Figure 3 to define the words that have been circled
[x] Add colon to “this code re-runs the Wordfish model and visualizations.”
[x] “[...] the partisan indicator described above” – link back to section
[x] Stick to “partisan indicator”

hawc2 commented 7 months ago

Thanks @anisa-hawes. Just to say we will try to get the rest of these changes done in the next couple weeks!

hawc2 commented 7 months ago

@anisa-hawes I've taken a shot at addressing alot of these remaining edits, and I think we're pretty close to being done. I've checked off everything we've addressed so far. There's still a few content related revisions we need to make, but we probably won't have time until later next week.

I'm wondering if this week you and @charlottejmc can start on the copyedits for this lesson, or at least help us with the remaining checkboxes here that you can address as easily as we can? I think at least 20 of the remaining items here are essentially copyedits.

hawc2 commented 7 months ago

In response to the edit suggestion for the #Interpretation section, I don't think it makes sense to bring commentary from the Visualization section up to the Interpretation section, as we are speaking about a different level of interpretation at that section. Let us know if you still think there's an issue with repetition

We're mostly working now on your recommendations for the Ethical Considerations section. We might not do all of the edits you've suggested, but we'll try to make it work. Overall, the Introduction section with the two subsequent sections seems to work well to outline the main lesson argument and content. Or I might not be totally understanding what issues some of these remaining edits are trying to address.

The additions to the Downloading Comments and Metadata section make sense and we'll do that by next week.

anisa-hawes commented 7 months ago

Thank you, @hawc2. That makes sense.

Absolutely happy to assign this to Charlotte for copyediting this week. She will be pleased to support with resolving any remaining copyedit-like clarifications from the checklist too.

(We can re-read the Ethical Considerations sections when you've made any additional edits that you feel are suitable).

charlottejmc commented 7 months ago

Hi @hawc2, @jantsen and @nlgarlic,

As Anisa mentioned, I'm very happy to start copyediting the lesson in its current state. I'll open a branch and work on my copyedits this week, including the points from Anisa's checklist which I can take on myself.

I think it would help keep things clear if you held off adding your own edits until I've prepared the Pull Request with my final copyedited version (I anticipate this should be ready by the end of this week, or Wednesday next week).

Once I've tagged you there, we could work together on integrating your outstanding changes to the clean copy in the copyedit branch, before merging everything at once. That should help avoid any conflicts with the main branch!

charlottejmc commented 6 months ago

Hello @hawc2, @jantsen, @nlgarlic, and @nabsiddiqui,

I've prepared a PR with the copyedits for your review.

There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find instructions for how to reply to my in-line comments, and how to make the outstanding changes which I know you have still been working on.

charlottejmc commented 6 months ago

Hello @hawc2, @jantsen and @nlgarlic,

Thank you very much for addressing my comments in the copyedit PR. I've now merged this in to update the lesson preview, which you can refer to in order to keep making the changes you are still working on.

I was not able to resolve all of my comments directly in the PR, so I'll list the outstanding points below to help guide your work:

[x] Improve the alt-text for figures 3, 4 and 5, to avoid simply repeating the caption. This resource may be useful. It advises that the alt-text for graphs and data visualisations should consist of the following:

alt="[Chart type] of [data type] where [reason for including chart]"

[x] Define 'political valence' in paragraph 2, or replace both instances of 'political valence' with 'opposing ideological underpinnings'
[ ] Confirm the asset link for the sample dataset you're providing readers, and confirm whether the code at line 199 refers to it correctly
[ ] Find a suitable citation to reference one of the 'qualitative sociological studies'
[x] Add a short note about what to select for the 'Limit' option (and other changes planned for this section)
[x] Reword/extend some of the text describing the visualizations to give a clearer idea of what they represent
[x] Provide a complete citation for Nanni, et al. (2019) and the blog post mentioned in endnote [^1]

hawc2 commented 6 months ago

Thanks @charlottejmc. I've made a series of edits resolving some of the outstanding issues. My co-authors are going to reread the lesson in full one last time over the next week, and we'll try to finalize everything by mid-May.

In the meantime, can I send you an additional screenshot image and links to add? I can work with you on resolving some of these citation and formatting details.

anisa-hawes commented 6 months ago

Hello @hawc2,

Thank you for the opportunity to re-read this lesson. I’ve suggested a couple of very small line copyedits (https://github.com/programminghistorian/ph-submissions/commit/dd57637f60c57cd61f05a6cebe1ba465786f8a71), and have a final few comments to share below (paragraph numbers refer to the preview):

contents:

As discussed yesterday, I think that clarifying the three main sections of the lessons using the headings you suggested (Data Collection, Data Preparation, and Data Analysis) sounds very sensible, and I think it would benefit the reader. I tend to find granular sub-sections less useful, unless they delineate key actions within a section.

A suggestion could be:

## Introduction
## Data Collection
? ### Set up [Setting up your coding environment]
## Data Preparation
## Data Analysis
## Conclusion 
## Endnotes

para 1:

I note this is an advanced difficult: 3 lesson. I would like to suggest that you include a sentence in paragraph one which establishes that. For example, something like: This lesson is aimed at those who have an established understanding of Natural Language Processing (NLP), and particularly those who have developed experience of textual analysis.

Reviewing the difficulty matrix, I think the key things are to be upfront about the prior knowledge and applied experience that are required. I think you can include specialist and technical terms confidently within that context, then explain concepts that could still be new.

Otherwise I think, for example, the final sentence of paragraph 1 marks a shift into a domain of knowledge that the initial sentences don’t prepare me for. I’m not necessarily suggesting adding a link or integrating a definitions to: primary dimensions at this difficulty level (because this is above my learning-level, I don't know if it is 'new'), but that immediately stands out to me as something which might need to be indicated is part of the prerequisite knowledge.

para 4-6:

Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.

para 9:

You refer to ‘recent scholarship’ and ‘a wide range of qualitative sociological studies’, which I think it would be useful to either cite, or collate as further readings.

para 11:

You describe the dataset as ‘expansive’. I just wanted to query this. Am I mis-remembering the size of the sample dataset? I recall that you’d previously mentioned a dataset of ~6 videos (provided that they have ~2000 comments combined).

para 1-16:

For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copyedits if you agree.

para 31:

I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations.

para 70:

I note that the term psi values doesn’t appear to have been mentioned previously, and isn’t expanded/defined here.

para 103:

You give the example of adding the contraction didn’t to your stopwords list, but in the previous step you mention that you have used the token function to remove punctuation. Would apostrophes in elided word forms be an exception?

figures 3 + 4

I wonder if you might consider adding in a fully zoomed-in view of the each of the word clusters as additional figures? I found the interpretation of Figures 3 and 4 difficult to follow because the words are close-to being indecipherable.

para 123:

I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent.

figure 5:

I note here that the terms used as axis labels on the graph are new to me: theta and psi and don’t appear to be explained in the lesson.

paras 133-135:

I wonder if you might take the opportunity to reflect on your finding (para 131) that the channel’s political affiliation does not seem to be a strong predictor of the commenters’ political positions. The questions in my mind: Are there learnings you could share about the specific things you might change (perhaps a different selection of videos, or a larger corpus of comment data)? Despite the 'polarized' political positions of the channels you selected from (according to the allsides ranking), were commenters equally engaged in discussion? Would comparison visualisations of comments responding to individual videos provide greater insight? What other kinds of research questions would be most usefully explored using this method?

hawc2 commented 5 months ago

Thank you @anisa-hawes! I think we're almost there. I've done another round of thorough edits, and we only have a few last issues to address (including we identified some recent updates to relevant libraries that will affect the code ever so slightly). While we work on those last items, I'm handing edits back to you to address some of the outstanding issues you said you could handle.

You can go ahead and do these: para 31: I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations. para 123: I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent. para 4-6: Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.

One thing you said I wasn't quite sure about. You said: "para 1-16:For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copy edits if you agree." I think to some extent the current structure was based on your recommendation to put the key steps of the lesson clearly up front. I might be misunderstanding what reorg you intend to do here, but feel free to take a shot at it. There is some repetition between the intro and the Scholarly Research and Learning Outcomes sections in the Intro, but it doesn't seem to require a major reordering.

If you have ideas for how to further streamline the Table of Contents and headings, please go ahead and do those edits as well. We can review your changes and decide if we agree, as it's difficult to describe otherwise.

We also have a few more images to add, that I can send to you for uploading in the coming days/week. Hopefully we can be done with all edits on our end by June 6th at the latest!

charlottejmc commented 5 months ago

Hi @hawc2,

I've added in the new images you sent over Slack. I've gone through and ironed out the filenames and numbers to get a little clarity: we now have 9 figures, numbered 01-09 (not 3a, 3b, etc.).

Just a couple notes:

Figure 4 (the list of filenames) is missing its caption
All figures still need some alt-text (either they are currently missing it, or it is insufficient): As you know, this descriptive element enables screen readers to read the information conveyed in the images for people with visual impairments, different learning abilities, or who cannot otherwise view them, for example due to a slow internet connection. It's important to say that alt-text should go further than repeating the figure captions.

We have found Amy Cesal's guide to Writing Alt Text for Data Visualization useful. This guide advises that alt-text for graphs and data visualisations should consist of the following:

alt="[Chart type] of [data type] where [reason for including chart]"

What Amy Cesal's guide achieves is prompting an author to reflect on their reasons for including the graph or visualisation. What idea does this support? What can a reader learn or understand from this visual?

The Graphs section of Diagram Center's guidance is also useful. Some key points (relevant to all graph types) we can take away from it are:

Briefly describe the graph and give a summary if one is immediately apparent
Provide any titles and axis labels
It is not necessary to describe the visual attributes of the graph (colour, shading, line-style etc.) unless there is an explicit need
Often, data shown in a graph can be converted into accessible tables

For general images, Harvard's guidance notes some useful ideas. A key point is to keep descriptions simple, and adapt them to the context and purpose for which the image is being included.

Would you feel comfortable making a first draft of the alt-text for each of the figures? This is certainly a bit time-consuming, but very worthwhile in terms of making the lesson accessible to the broadest possible audience.

Do let us know when you are ready for to hand the lesson over for copyediting.

Thank you very much! ✨

hawc2 commented 5 months ago

@charlottejmc we have our final edits ready, I'll send them to you via email and you can go ahead with copyedits. Thanks!

charlottejmc commented 4 months ago

Dear @hawc2,

Thank you for providing your final Phase 5 edits over email. I've now applied these to the lesson file. Just a couple points to go over together:

[x] Your work on the alt-text was extremely helpful, thank you so much! You were right that the alt-text for Figures 5 and 9 was slightly too long – mostly, it made points about the visualizations which went above and beyond a visual description. I removed these parts from the alt-text. However, I felt that these would be really helpful to help readers better understand what was happening in these visualizations! So, I simply used your remaining sentences inside the main text, weaving them into your existing descriptions. Please let me know what you think! (Paragraphs 116-119, and 137-139).
[x] Figure 9: AT line 458, you say The Wordfish model assigns two parameters to each word used in the corpus studied (beta and psi), and a similar two to each document (alpha and theta). However, I still see 'psi' on the y-axis for Figure 9 itself. Would we need to update the image?
[x] On a similar note, would you be able to give a clearer explanation of what exactly alpha, beta, and psi are? You've also described them as 'model outputs' or 'parameters' interchangeably, which seems a little surprising.

I also went back through the issue to look for any un-ticked checkboxes that may have been lost in the back-and-forth.

From Anisa's comment:

[x] ETHICAL CONSIDERATIONS - Did your research group include researchers from the communities represented by this dataset?
[x] IMPORT DATA - Share a template displaying the required column names + order (in raw Markdown) for readers who “choose to use a YouTube comment dataset downloaded with a tool other than YouTube Data Tools”. (If you let us know the column names + order, @charlottejmc and I could create this in raw Markdown for readers).
[x] CREATE A CORPUS IN R - Link to definitions of “stemmed” and “lemmatised”
[x] SELECT COMMENTS - If we choose to share a template displaying the required column names + order, link back to that here

From my comment:

[x] I see you've deleted an outdated file from the assets, but I'm still worried that there might be some confusion about downloading and using the sample dataset. For example, we tell readers to download the sample dataset with https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/text-mining-youtube-comments/ytdt_data.zip. At line 205, you say "The following code iteratively reads in all of the comment data from the comments.csv files in the ytdt_data folder" – however, having downloaded ytdt_data.zip, I see that the comments.csv and basicinfo.csv files are nested within the folders for each videoID. Am I understanding correctly that the code will know to go through these videoID folders to find the comments.csv and basicinfo.csv files within?

From your comment:

[x] Add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations. If you'd like to do this, could you please provide author, title, version, year of release, license, repository URL. for YouTube Data Tools? Thank you!
[x] Para 4-6: Just a note to say that if we were to revise and simplify the table of contents, some adjustments will be required here.
[x] Para 1-16: Reorganisation to ensure that the contextual overviews are grouped ahead of the practical paragraphs

I understand you may not want to make all of these changes (quite a few of them are simply suggestions), but it would be very helpful if you could give a final sign-off on what you think of them!

Thanks so much for your work on this. ✨

hawc2 commented 4 months ago

@charlottejmc we're working on a final round of edits of this draft, for you to be able to take over for final copyedits by Wednesday, the 24th of July. I'm adding additional thoughts and comments here to address the remaining concerns and questions. There's a few details here that we are leaving for you and @anisa-hawes to decide whether they should be included during copyedits, such as specific references addressing gaps you mentioned.

[x] "Figure 9: AT line 458, you say The Wordfish model assigns two parameters to each word used in the corpus studied (beta and psi), and a similar two to each document (alpha and theta). However, I still see 'psi' on the y-axis for Figure 9 itself. Would we need to update the image?"
- our mistake here. We must not have updated the ggplot axis labels code correctly; this should still be alpha. @charlottejmc when we met I gave you the fixed code (with the single change of updating the word "psi" for "alpha" in line 145 (creation of the comment scaling visualization) and updated figure which included the correct alt text. For the broader question of parameterization, we can explain in a bit more detail here.
[x] "On a similar note, would you be able to give a clearer explanation of what exactly alpha, beta, and psi are? You've also described them as 'model outputs' or 'parameters' interchangeably, which seems a little surprising."
- The explanation is accurate, though if any specific concerns about clarity - or if we need a sentence explaining how the same variables can be both parameters and outputs, we can add it. Here’s how we would explain it: “The four model parameters are alpha, beta, theta, and psi as discussed. Wordfish works by forming initial 'guesses' for each value of each variable. Then, Wordfish tweaks these values iteratively to increase overall model fit, pass by pass by pass. Once all values for each parameter / variable have been jointly optimized to maximize fit, the resulting values that they take are the key model outputs. There are no other parameters the model uses, and no other outputs, so those four variables do serve as both.”
[x] "Please can you confirm the asset link for the sample dataset you're providing readers, and confirm whether the code at line 199 refers to it correctly?"
This is correct as is. We performed a final check here, and the approrpiate data was available within the github zip file, with identical file structure. The list.files() function does search within directories and subdirectories.
[x] Below are 3 links to info on stemming and lemmatization, with a quick note on why we might or might not want to include. We are fine with Wikipedia as an alternative. Alot of examples about code uses python; the tutorials in R were much less clear as to the actual substantive discussion of the terminology ** https://ayselaydin.medium.com/2-stemming-lemmatization-in-nlp-text-preprocessing-techniques-adfe4d84ceee
This is a very short read, and clearly distinguishes between the two approaches, following up with a bit of python code. ** https://www.datacamp.com/tutorial/stemming-lemmatization-python
Longer read, but gives specific pros and cons to each approach which I really like.
** https://keremkargin.medium.com/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768 -- More of a step-by-step guide to NLP in python, though the specific examples of stemming vs lemmatization are early on & are very clear and to the point.
[x] "IMPORT DATA - Share a template displaying the required column names + order (in raw Markdown) for readers who “choose to use a YouTube comment dataset downloaded with a tool other than YouTube Data Tools”. (If you let us know the column names + order, @charlottejmc and I could create this in raw Markdown for readers)." Variables listed below are needed for initial input - note that these are all character vector inputs, and that the additional columns utilized in the lesson are added with the lesson code itself Columns: 7 $ videoId "2Vz5_bzzXtA", "2Vz5_bzzXtA", "2Vz5_bzzXtA", "2… $ commentId "f406385bfd07d8e08830880bf8299a94d0e19402", "5a… $ authorName "a7ef2113cd836386a0f353835efdf6ada7663648", "0d… $ commentText "Thanks Ben for clearing things up! I’m white I… $ videoChannelTitle "Ben Shapiro", "Ben Shapiro", "Ben Shapiro", "B… $ videoTitle "Ben Shapiro Explains Why Conservatives Critici… $ commentCount "4907", "4907", "4907", "4907", "4907", "4907",…

EDIT considering the recent edits which removed the quoted sentence, we think it makes sense without the template columns

[x] Additional citation information for YouTube Data Tools (YTDT) if needed: There is currently no publication on YTDT. But the different citation standards provide guidelines for how to cite software, e.g. APA: Rieder, Bernhard (2015). YouTube Data Tools (Version 1.42) [Software]. Available from https://ytdt.digitalmethods.net/. Alternatively, you can cite this blog post. If you are interested in the kind of work that can be done with this tool, check out this research paper. From: https://ytdt.digitalmethods.net/faq.php

Two small issues we noticed I'm flagging here, will add additional notes later if our edits bring up other questions:

[ ] Para 55 - the HTML in the discussion of the renv package isn't fully rendering correctly
[x] Para 101, 108 - the para numeric indicator overlaps the text - not sure if this can be avoided (or would be an issue in the final version) EDIT: this won't be an issue, as paragraph numbers don't show on the live version
[ ] The bibliography and endnotes seem to have been conflated, and there's a few endnotes that are supposed to link to bibliographic entries, currently that's not working. When you finalize the bibliography, can you add this book to the list?: https://www.amazon.com/YouTube-Participatory-Culture-Digital-Society/dp/0745660193

anisa-hawes commented 4 months ago

Sincere thanks, @hawc2, for all the energy you, Nicole @nlgarlic, and Jeff @jantsen have dedicated to finalising your revisions.

This lesson is now with @charlottejmc for copyediting who aims to complete the work by Friday 2nd August.

@hawc2, @nlgarlic and @jantsen please don't make further edits to your files during this Phase.

Any further revisions can be discussed after copyedits are complete.

Thank you for your understanding.

charlottejmc commented 4 months ago

Hello @hawc2, @nlgarlic,@jantsen and @nabsiddiqui. I've prepared a PR with the copyedits for your review.

There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find brief instructions for how to reply to any questions or comments which came up during the copyedit.

When you're both happy, we can merge in the PR.

Thank you!

charlottejmc commented 3 months ago

Hello @hawc2,

This lesson's sustainability + accessibility checks are in progress.

Preview:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments

Publisher's sustainability + accessibility actions:

[x] Copyediting
[x] Typesetting
[x] Addition of Perma.cc links
[x] Check/resize images
[x] Check/adjust image filenames
[x] Receipt of author(s) copyright agreement (EN declaration form.) – @hawc2, would you be able to provide this on behalf of your co-authors? We only need one form per lesson. Thank you!
[x] Request doi
[x] Remove outside contributors from ph-submissions
[x] Add lesson slug to our Annual Count of Published Lessons

Authorial / editorial input to YAML:

[x] Define difficulty:, based on the criteria set out here
[x] Define the research activity: this lesson supports (acquiring, transforming, analysing, presenting, or sustaining) Choose one
[x] Define the lesson's topics: (api, python, data-management, data-manipulation, distant-reading, get-ready, lod ["Linked Open Data"], mapping, network-analysis, web-scraping, website ["Digital Publishing"], r, machine-learning, creative-coding, web-archiving or data-visualization) Choose one or more. Let us know if you'd like us to add a new topic. Topics are defined in /_data/topics.yml.
[x] Provide alt-text for all figures
[x] Provide a short abstract: for the lesson
[x] Agree an avatar (thumbnail image) to accompany the lesson

The image must be:

copyright-free

non-offensive

an illustration (not a photograph)

at least 200 pixels width and height Image collections of the British Library, Internet Archive Book Images, Library of Congress Maps as well as their Photos/Prints/Drawings or the Virtual Manuscript Library of Switzerland are useful places to search

[x] Provide avatar_alt: (visual description of that thumbnail image)
[x] Provide author(s) bio for ph_authors.yml using this template:

- name: Forename Surname
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Forename Surname is an Assistant Professor in the Department of Subject at the University of City.
    es: |
      Forename Surname es profesor adjunto en el Departamento de Asignaturas de la Universidad de Ciudad.
    fr: |
      Forename Surname est professeur assistant au département des matières à l'université de City.
    pt: |
      Forename Surname é professor assistente no Departamento de Assunto da Universidade da Cidade.

Files we are preparing for transfer to Jekyll:

EN:

.md file: /en/drafts/originals/text-mining-youtube-comments.md
images: /images/text-mining-youtube-comments
assets: /assets/text-mining-youtube-comments
original avatar: /gallery/originals
gallery avatar: /gallery

Promotion:

[x] Prepare announcement post (using template)
[ ] Prepare evergreen posts – @hawc2, could you please provide x2 posts for future promotion via our social media channels? You can add these directly to the ph-evergreens-twitter-x spreadsheet that has been shared with you, or email them to me at publishing.assistant[@]programminghistorian.org.

anisa-hawes commented 3 months ago

Hello Alex @hawc2, Nicole @nlgarlic, Jeff @jantsen, and @nabsiddiqui,

We're very close to publication!

A few final elements to confirm from Charlotte's comment above:

[x] the 'avatar' (thumbnail) image to accompany the lesson (I know you in discussion about this! - let us know when you've agreed)
[x] copyright agreement (Our EN declaration form is available here - we only need one form per group when lessons are co-authored)
[x] short bios for each of you Nicole @nlgarlic and Jeff @jantsen (Alex, we have yours ☺️). These are one-sentence bios, following this example structure:

- name: Nicole 'Nikki' Lemire-Garlic
  orcid: 0000-0002-8988-5188
  team: false
  bio:
    en: |
      Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.

- name: Jeff Antsen
  orcid: 0000-0002-9787-7583
  team: false
  bio:
    en: |
      Jeff Antsen is the owner of Half-Moon Research, an independent consultancy specializing in market research and mixed-methods approaches.

If we have everything together, we could publish this mid-week next week.

Thank you, Anisa

nlgarlic commented 3 months ago

Thanks, Anisa! I’ve attached a signed copy of the declaration form and here is my info:

Name: Nicole “Nikki” Lemire-Garlic Orcid: https://orcid.org/0000-0002-8988-5188 Bio: Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.

anisa-hawes commented 3 months ago

Thank you, @nlgarlic!

Please could you email the copyright form to Charlotte? > publishing.assistant[@]programminghistorian.org. (We don't receive attachments within replies to GitHub comments).

As Alex @hawc2 is needing to play the role of both ME + co-author, I've done a final read-through of this lesson, and have made a few line edits: https://github.com/programminghistorian/ph-submissions/commit/25191046f4d5848ad5d45e51a8cdd2cf55d7dae9 (+ a small correction of my own mistake! https://github.com/programminghistorian/ph-submissions/commit/e9609b4ab64526c2c65941e9d9a92b064ecacd5a)