programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
139 stars 114 forks source link

Text Mining YouTube Comment Data with Wordfish in R #374

Closed nabsiddiqui closed 3 months ago

nabsiddiqui commented 3 years ago

The Programming Historian has received the following tutorial “Text Mining YouTube Comment Data with Wordfish in R” by @hawc2, @jantsen, and @nlgarlic. This lesson is now under review and is available here:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian’s principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

nabsiddiqui commented 3 years ago

This is a very interesting tutorial that I think our audience will enjoy. There are two macro issues that need to be addressed before moving forward and some small typos suggestions:

Macro Issues

  1. The code blocks throughout this tutorial need to contain comments. Right now, they are difficult to follow.
  2. When submitting the tutorial, it would be better if it was turned into a Markdown file first so that those that are reviewing it can follow it more easily. The code should still be available to create the visualizations and the visualization should be below the code. There is information on how to do that here: https://bookdown.org/yihui/rmarkdown/markdown-document.html.

Micro Issues

Please let me know what your timeline is on this and any questions you may have @hawc2, @jantsen, and @nlgarlic.

hawc2 commented 3 years ago

Thanks @nabsiddiqui. I made the minor edits you mentioned, except Paragraph 8 and 9 seem worth keeping, perhaps as a footnote?

I also wasn't sure what screenshot we should insert for Paragraph 3 in Configuring your Code?

We will update the code to include more comments as you ask, and once we get the knitting to work correctly with the .rmd file, we will update the Github repo with the proper formatting for the markdown file.

We aim to be done with our edits next week. I'll update you when the file is ready for review.

nabsiddiqui commented 3 years ago

Sounds good @hawc2. Let me know if you need anything else on my end. And yes, paragraph 8 and 9 can be kept as footnotes.

svmelton commented 2 years ago

Hi @nabsiddiqui and @hawc2! Just checking in to see if you needed any help moving this lesson forward.

nlgarlic commented 2 years ago

Hi - thanks for reaching out. We actually just met to go over the changes this week. We expect to have them ready in the next couple of weeks at the latest.

Nikki

Sent from my iPhone

On Dec 9, 2021, at 7:21 PM, Sarah Melton @.***> wrote:



Hi @nabsiddiquihttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnabsiddiqui&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KmaDX2Wh5gX47q%2Fh%2Bj%2B5wgG5oxzB4EnW7L2oXo8HZJw%3D&reserved=0 and @hawc2https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhawc2&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PsSmAC17EazwcsGUqIg2dngvJZoWcyXoKQLoKBbc%2FJA%3D&reserved=0! Just checking in to see if you needed any help moving this lesson forward.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprogramminghistorian%2Fph-submissions%2Fissues%2F374%23issuecomment-990437810&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749146818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UOyQdDS%2FTe4m%2By0a6ih62B%2FMOUKK3VgT6Vlsul%2BMTz4%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANJXLVGU76JBLWWBT5PQIXDUQFBXRANCNFSM46VS3TNQ&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SdTHn7UfETK3%2FfOwLIyKK4qtqp1JuCPxVYWg16JSeVo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uKU2xGRERE9qAqfvshxrC%2FbuxQXwOxjXUOdjbDhpMnQ%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749166735%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9HERDXqjmQyXOw24W%2BWw9sc9X55UyFpVi%2FMOSLN%2Bg78%3D&reserved=0.

nabsiddiqui commented 2 years ago

Hey @svmelton. @hawc2 had requested some additional time to work on this in an email he sent to me. I should have probably communicated that in the issue tracker. But, no I think we are all good on moving the lesson forward as planned.

anisa-hawes commented 2 years ago

Hello all,

Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r

Please let me know if you encounter any difficulties or have any questions.

Very best, Anisa

jenniferisasi commented 2 years ago

@nabsiddiqui we now have a new topic for R for the menu, please include it as "r" in the topics in the lesson metadata. Thanks

anisa-hawes commented 2 years ago

Thank you, @jenniferisasi! I've added this to the YAML for you @nabsiddiqui.

hawc2 commented 2 years ago

@nabsiddiqui we’re excited to report we finished updating our YouTube scraping tutorial and it should now be ready for review: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r

Apologies for delays - after we submitted this lesson earlier in the pandemic, we discovered that there were a few sustainability issues, especially involving some of the libraries we were using for text wrangling data into WordFish. We’ve switched to quanteda, and in the process, we condensed the lesson and hopefully simplified/clarified a few sections. We’ve made some other updates as well in order to streamline the code, including removing specific directions for setting up access to the YouTube API, since those directions seem to be changing regularly, and we can link to the Google page with the directions.

We also removed a few options for granular scraping, including a way to search for videos through the API. We intend to provide some of these alternatives on our Github page for those who’d like to explore further. Near the end, we’ve reduced the number and complexity of visualizations, but if reviewers think it needs more, we could build that section out more. Probably some steps in the newer version of the code need to be explicated further.

We look forward to getting feedback on this lesson!

nabsiddiqui commented 2 years ago

Dr. Heather Lang @hlang264 and Dr. Janna Joceli Omena @jannajoceli have graciously agreed to serve as reviewers. We are shooting for a late August response for now @hawc2.

hawc2 commented 2 years ago

Thanks @nabsiddiqui.

One update - YouTube has created a streamlined path for Researchers looking to access their API: https://research.youtube/

This might make some aspects of our tutorial easier, and may require some updating. We'll investigate how this process works and update our tutorial accordingly in the fall after we receive reviewer feedback.

jannajoceli commented 2 years ago

Tutorial review by Janna Joceli Omena

Title: "Text Mining YouTube Comment Data with Wordfish in R"  GitHub link: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r Editor: Nabeel Siddiqui

Overall evaluation This tutorial uses a natural language processing algorithm (Wordfish) to conduct textual analysis from Youtube. It presents a good overview of YouTube as a platform. However, it requires some improvement in clarifying the data collection method to the reader. Moreover, the tutorial can benefit from some work on reorganising the order of sections and, in some cases, renaming the headings and subheadings. Finally, bullet point lists, annotated screenshots, gifs and short videos are recommended as pedagogic tools to be considered in this tutorial.

Review statement The tutorial review will follow a bullet-point mode, proposing suggestions, providing feedback and raising questions to the authors.

Part I: Introduction to YouTube Scraping and Analysis

Part II: Scraping the YouTube API

How to create YT credentials? 1. First, xxxxxxxx

  1. Then, xxxxxxxxx
  2. Finally, xxxxxxxx

Screenshots, gifs or short videos are often super helpful for these tasks.

Part III: Optimizing YouTube Comment Data For Wordfish

Part IV: Modeling YouTube Comments in R with WordFish

hlang264 commented 2 years ago

Overall, I think the tutorial is very useful and well put together. I agree with many of Dr. Omena’s suggestions regarding clarity and usability. If following the tutorial one piece at a time, a user is likely to successfully gather and visualize this data. However, I think an alternate organization that begins with a clear justification for this method, overviews the functionality of each tool, and then ends in a more clear-cut set of directions would be useful. I also think more or different examples (see comments below) would add to the clarity of the tutorial and help researchers know when and how to employ these methods. I think most of my suggestions or comments come back to that idea: a tutorial is most useful when a user knows when and why to deploy it, not just how. I think more clarity in that regard would be useful.

Opening: It would be useful if there was a justification for why a person would need to know using R or scaping data/crawling. I think there is less need to justify YouTube as a site of research and more need to demonstrate to the audience what kinds of questions/problems this method of scraping and visualizing data might address. If the audience is scholars who don’t know how to use these kinds of methods, they’ll need to be able to make the connection between this method and their overarching goals.

P35-36: This section is a bit confusing. It’s unclear how many words/comments would be needed to do a successful model—an example would be useful here. Maybe a link to a video with comments that are successful or linked csv that demonstrates comments.

P38: There is guidance here about what is “better”, but what is “better” is determined by a researcher’s purpose and dataset. The use of “better” here is somewhat confusing to me because I am not sure what the goal is. If the goal is to model differential data points, then this makes more sense, but I think I’m confused about the purpose of the model.

P54: If a person is doing the tutorial as they are reading it, I think they’ll do ok with keeping up with the steps. However, it would be challenging to retrace steps or review steps quickly because steps are so embedded. It might be nice for there to be a clearer set of step-by-step instructions to revisit so that a person isn’t looking for signal phrases like “Now that you’ve finalized your list of videos, gathered the metadata and comments for each video, and optimized your data for Wordfish through filtering, you are ready to clean and wrangle the data!”

P62-63: It would be nice to get a clearer sense of which “approaches” are described here and when/why a person might choose one over another.

P87: While I understand that a user can manipulate the visualizations crated in WordFish, these models don’t feel especially useful to me because they are illegible. I think demonstrating some simplified visualizations might be useful and/or providing alternate formats where the visualization is more usable.

hawc2 commented 2 years ago

Thanks @jannajoceli and @hlang264 for your helpful feedback! @nabsiddiqui is there anything you were going to add that we should take into consideration before starting to revise?

nabsiddiqui commented 2 years ago

@hawc2 I was just going to summarize what the other reviewers said, but I think you can go ahead and start revising based on these suggestions. If there is anything you think isn't particularly relevant, we can discuss that later if you don't want to put it into the revision.

hawc2 commented 1 year ago

hey all, just to say we've almost finished revising this lesson. I'm going to do some last tweaks and test of the code next week, and then I'll ping you all when it should be ready for a final review.

anisa-hawes commented 1 year ago

A note to say that the lesson markdown file has been renamed, and the lesson slug: adjusted accordingly.

I have also:

hawc2 commented 1 year ago

Thank you @anisa-hawes! @nabsiddiqui, I think this draft should be ready for final review by the peer reviewers.

I'm curious to hear how it goes when you test the code, it seems to be working on our end, but things can get especially wonky with the first part about accessing the API.

@nlgarlic and I are available to make further revisions and updates, and are happy to chat about any of our decisions here. We did include information on setting you an account to access the YouTube API, but we are still leaning a bunch on the Google documentation because it keeps changing.

Thanks everyone for your patience with the time it took us to update this lesson, and we look forward to next steps!

nabsiddiqui commented 1 year ago

Hey @hawc2. It will likely be about another two weeks until I get to this, but I have placed a reminder in my calendar to come back to it soon.

hawc2 commented 1 year ago

@nabsiddiqui any updates on this lesson's timeline from here?

hlang264 commented 1 year ago

Hi there,

Thanks for your patience with this. I've just had a chance to review the changes, and I think the final draft is a great contribution. Thanks for your work on this--I'm excited to see how folks use it. I think this is ready to move on to the next stage.

Best wishes,

Heather

spapastamkou commented 1 year ago

Hello @nabsiddiqui @hawc2 and everyone. Thanks for this lesson, it would be great to see it published - especially since people who - like me - can no longer collect easily twitter data, are looking for ways to collect other social media data. Thus, if I may, I would like to ask how the code could be adapted to collect data from the live chat comments available sometimes on the right side f the video (and not from the comments that appear under a video). I do not find a lot of resources out there explaining how to do this. Thanks a lot!

hawc2 commented 1 year ago

@nabsiddiqui I'll meet with my co-authors this week to discuss final revisions we will make to the lesson. We are aiming to complete revisions for this lesson within the next few weeks. Please let us know if there's anything additional we should take into consideration at this time. Partly what we aim to do is make sure nothing has changed in the YouTube API that could affect the lesson now.

@spapastamkou thanks for your encouraging thoughts! Agreed that it's valuable to offer tutorials on other social media platforms less restrictive than X/Twitter. I'll chat with my co-authors about your question regarding live chat comments, I remember us looking into this at one point and thinking it seemed doable but outside the scope of this lesson. We could at least add some brief info about that in the lesson to point people in the right direction.

spapastamkou commented 1 year ago

thanks @hawc2 !

nabsiddiqui commented 1 year ago

I looked through the wording, etc. and I think we can now close out this portion of the review process. So, I think we are ready to start working on next steps @anisa-hawes and @hawc2

hawc2 commented 11 months ago

@nabsiddiqui @anisa-hawes just a heads up, we are making final edits on the lesson now. We've done some thinking and made a few changes to the part of the project requiring access to the Google API that I think will make the lesson much more sustainable.

Once this round of edits is done, I'll hand it over to @anisa-hawes for a final look over. @anisa-hawes I think it will be necessary for you to take on the ME role in this case, doing one last read through of the lesson for quality control, and giving us a last round of edits before it is sent to copyeditor for preparation to be published. Let's aim to publish this lesson in early 2024?

hawc2 commented 9 months ago

Hey Charlotte, I thought Anisa was going to do a read through and give us feedback first? It's ok if she wants to do it after copyedits, but I just want to flag that since I am the author on this, I can't do the final review as ME, so I was hoping Anisa could provide us feedback for one last round of review. I also have a few edits I need to make that I can make as soon as today if you are about to go into copyedits. Let me know if I can still do that or if I should wait. My recommendation in the future would be to double check with authors they are done editing before bringing things into copy edits, as my last communication with Anisa via slack had not suggested this was ready for copy edits quite yet. Best Alex

On Fri, Feb 16, 2024, 5:42 AM charlottejmc @.***> wrote:

Hello @hawc2 https://github.com/hawc2, @jantsen https://github.com/jantsen, and @nlgarlic https://github.com/nlgarlic,

This lesson is now with me for copyediting. I aim to complete the work by ~Friday 08 March.

Please note that you won't have direct access to make further edits to your files during this Phase.

Any further revisions can be discussed with your editor @nabsiddiqui https://github.com/nabsiddiqui, after copyedits are complete.

Thank you for your understanding.

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/374#issuecomment-1948142664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EEKXFI3O37WA5UHP3LYT4ZZTAVCNFSM46VS3TN2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUHAYTIMRWGY2A . You are receiving this because you were mentioned.Message ID: @.***>

charlottejmc commented 9 months ago

Hi @hawc2, yes, Anisa will be adding her feedback shortly!

I was a little hasty and initially posted a comment + opened a branch to start copyediting, but I quickly deleted both as I realised I wanted to wait for Anisa's comment before I began my work. I imagine the email updates from GitHub might not have reflected that on your side.

There's no rush for you to make edits just now.

My apologies for the confusion!

anisa-hawes commented 7 months ago

Hello @hawc2,

Thank you again for the opportunity to read this co-authored lesson text-mining-youtube-comments. I think it is excellent. I can see that you've made some substantial revisions (https://github.com/programminghistorian/ph-submissions/commit/c4d044d0103461c93497ce066fadb2e10a779d42 and https://github.com/programminghistorian/ph-submissions/commit/bea9140a7fc2bb014a4555f6e22fdd2dd5c89461) since I shared my feedback with you by email. Mapping my initial feedback against the current draft here on GitHub, I thought it might be useful to set out my remaining suggestions as optional tasks / or ideas for you to think-through and reject. I'll check off the suggestions/points which I think you have already resolved.

Overall, I suggested that the lesson structure might be revised so that table of contents is simplified. At the moment, the sub-sections are very granular and, from my point of view as a reader, this made navigation quite confusing. I think the following revisions could enable you to provide readers with a broader overview of the lesson as a whole, as well as clear signposts into specific parts. It seems to me that the high-level chapter headings could be:

Introduction #comprising your overview of the method + ethical considerations
Data Collection #comprising set up + install + getting started with the tools to download metadata and comment
Cleaning and Modeling #comprising removing stopwords, applying filters to clean your data, modeling the columns
Analysis 
Visualisation
Conclusions
Endnotes

Anyway, these suggestions are grouped by section headings as they are (rather than paragraph/line number, because everything has shifted since I worked through this).


Introduction

YouTube and Discourse Analysis

Learning Outcomes

Data Collection

Ethical Considerations for Social Media Analysis

Accessing YouTube Data

Video Selection

Downloading Comments and Metadata

Set Up your Coding Environment

Install R and RStudio

Install R Libraries

Import Data

Data Labeling

Pre-processing and Cleaning

Remove Stopwords and Punctuation

Wordfish

Interpretation

Latent Meaning

Document Feature Matrices (DFM)

Create a Corpus in R

Select Comments

Build Corpus Object

Data Transformation

Data Optimization

Verification

Build Wordfish Model

Unique Words

Removing Outliers

hawc2 commented 7 months ago

Thanks @anisa-hawes. Just to say we will try to get the rest of these changes done in the next couple weeks!

hawc2 commented 7 months ago

@anisa-hawes I've taken a shot at addressing alot of these remaining edits, and I think we're pretty close to being done. I've checked off everything we've addressed so far. There's still a few content related revisions we need to make, but we probably won't have time until later next week.

I'm wondering if this week you and @charlottejmc can start on the copyedits for this lesson, or at least help us with the remaining checkboxes here that you can address as easily as we can? I think at least 20 of the remaining items here are essentially copyedits.

hawc2 commented 7 months ago

In response to the edit suggestion for the #Interpretation section, I don't think it makes sense to bring commentary from the Visualization section up to the Interpretation section, as we are speaking about a different level of interpretation at that section. Let us know if you still think there's an issue with repetition

We're mostly working now on your recommendations for the Ethical Considerations section. We might not do all of the edits you've suggested, but we'll try to make it work. Overall, the Introduction section with the two subsequent sections seems to work well to outline the main lesson argument and content. Or I might not be totally understanding what issues some of these remaining edits are trying to address.

The additions to the Downloading Comments and Metadata section make sense and we'll do that by next week.

anisa-hawes commented 7 months ago

Thank you, @hawc2. That makes sense.

Absolutely happy to assign this to Charlotte for copyediting this week. She will be pleased to support with resolving any remaining copyedit-like clarifications from the checklist too.

(We can re-read the Ethical Considerations sections when you've made any additional edits that you feel are suitable).

charlottejmc commented 7 months ago

Hi @hawc2, @jantsen and @nlgarlic,

As Anisa mentioned, I'm very happy to start copyediting the lesson in its current state. I'll open a branch and work on my copyedits this week, including the points from Anisa's checklist which I can take on myself.

I think it would help keep things clear if you held off adding your own edits until I've prepared the Pull Request with my final copyedited version (I anticipate this should be ready by the end of this week, or Wednesday next week).

Once I've tagged you there, we could work together on integrating your outstanding changes to the clean copy in the copyedit branch, before merging everything at once. That should help avoid any conflicts with the main branch!

charlottejmc commented 6 months ago

Hello @hawc2, @jantsen, @nlgarlic, and @nabsiddiqui,

I've prepared a PR with the copyedits for your review.

There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find instructions for how to reply to my in-line comments, and how to make the outstanding changes which I know you have still been working on.

charlottejmc commented 6 months ago

Hello @hawc2, @jantsen and @nlgarlic,

Thank you very much for addressing my comments in the copyedit PR. I've now merged this in to update the lesson preview, which you can refer to in order to keep making the changes you are still working on.

I was not able to resolve all of my comments directly in the PR, so I'll list the outstanding points below to help guide your work:

alt="[Chart type] of [data type] where [reason for including chart]"

hawc2 commented 6 months ago

Thanks @charlottejmc. I've made a series of edits resolving some of the outstanding issues. My co-authors are going to reread the lesson in full one last time over the next week, and we'll try to finalize everything by mid-May.

In the meantime, can I send you an additional screenshot image and links to add? I can work with you on resolving some of these citation and formatting details.

anisa-hawes commented 6 months ago

Hello @hawc2,

Thank you for the opportunity to re-read this lesson. I’ve suggested a couple of very small line copyedits (https://github.com/programminghistorian/ph-submissions/commit/dd57637f60c57cd61f05a6cebe1ba465786f8a71), and have a final few comments to share below (paragraph numbers refer to the preview):

contents:

As discussed yesterday, I think that clarifying the three main sections of the lessons using the headings you suggested (Data Collection, Data Preparation, and Data Analysis) sounds very sensible, and I think it would benefit the reader. I tend to find granular sub-sections less useful, unless they delineate key actions within a section.

A suggestion could be:

## Introduction
## Data Collection
? ### Set up [Setting up your coding environment]
## Data Preparation
## Data Analysis
## Conclusion 
## Endnotes

para 1:

I note this is an advanced difficult: 3 lesson. I would like to suggest that you include a sentence in paragraph one which establishes that. For example, something like: This lesson is aimed at those who have an established understanding of Natural Language Processing (NLP), and particularly those who have developed experience of textual analysis.

Reviewing the difficulty matrix, I think the key things are to be upfront about the prior knowledge and applied experience that are required. I think you can include specialist and technical terms confidently within that context, then explain concepts that could still be new.

Otherwise I think, for example, the final sentence of paragraph 1 marks a shift into a domain of knowledge that the initial sentences don’t prepare me for. I’m not necessarily suggesting adding a link or integrating a definitions to: primary dimensions at this difficulty level (because this is above my learning-level, I don't know if it is 'new'), but that immediately stands out to me as something which might need to be indicated is part of the prerequisite knowledge.

para 4-6:

Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.

para 9:

You refer to ‘recent scholarship’ and ‘a wide range of qualitative sociological studies’, which I think it would be useful to either cite, or collate as further readings.

para 11:

You describe the dataset as ‘expansive’. I just wanted to query this. Am I mis-remembering the size of the sample dataset? I recall that you’d previously mentioned a dataset of ~6 videos (provided that they have ~2000 comments combined).

para 1-16:

For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copyedits if you agree.

para 31:

I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations.

para 70:

I note that the term psi values doesn’t appear to have been mentioned previously, and isn’t expanded/defined here.

para 103:

You give the example of adding the contraction didn’t to your stopwords list, but in the previous step you mention that you have used the token function to remove punctuation. Would apostrophes in elided word forms be an exception?

figures 3 + 4

I wonder if you might consider adding in a fully zoomed-in view of the each of the word clusters as additional figures? I found the interpretation of Figures 3 and 4 difficult to follow because the words are close-to being indecipherable.

para 123:

I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent.

figure 5:

I note here that the terms used as axis labels on the graph are new to me: theta and psi and don’t appear to be explained in the lesson.

paras 133-135:

I wonder if you might take the opportunity to reflect on your finding (para 131) that the channel’s political affiliation does not seem to be a strong predictor of the commenters’ political positions. The questions in my mind: Are there learnings you could share about the specific things you might change (perhaps a different selection of videos, or a larger corpus of comment data)? Despite the 'polarized' political positions of the channels you selected from (according to the allsides ranking), were commenters equally engaged in discussion? Would comparison visualisations of comments responding to individual videos provide greater insight? What other kinds of research questions would be most usefully explored using this method?

hawc2 commented 5 months ago

Thank you @anisa-hawes! I think we're almost there. I've done another round of thorough edits, and we only have a few last issues to address (including we identified some recent updates to relevant libraries that will affect the code ever so slightly). While we work on those last items, I'm handing edits back to you to address some of the outstanding issues you said you could handle.

You can go ahead and do these: para 31: I think we should add a full citation for the YouTube Data Tools software using these guidelines which follow Software Heritage recommendations. para 123: I wonder if the note at paragraph 123 would be best as an information box? The text in the visualisations is so faint that I think the point about image quality and how to use the zoom functions to explore should be prominent. para 4-6: Just a note to say that if we were to revise and simplify the table of contents (as discussed yesterday), some adjustments will be required here.

One thing you said I wasn't quite sure about. You said: "para 1-16:For the introductory paragraphs (from 1-16), I’d suggest reorganisation to ensure that the contextual overviews are grouped ahead of the practical This lesson will-type paragraphs which help to signpost the reader. You indicate this intention with the two sub-section headings (YouTube and Scholarly Research, Learning Outcomes) but I think some further adjustment might be useful. This is something we can do in copy edits if you agree." I think to some extent the current structure was based on your recommendation to put the key steps of the lesson clearly up front. I might be misunderstanding what reorg you intend to do here, but feel free to take a shot at it. There is some repetition between the intro and the Scholarly Research and Learning Outcomes sections in the Intro, but it doesn't seem to require a major reordering.

If you have ideas for how to further streamline the Table of Contents and headings, please go ahead and do those edits as well. We can review your changes and decide if we agree, as it's difficult to describe otherwise.

We also have a few more images to add, that I can send to you for uploading in the coming days/week. Hopefully we can be done with all edits on our end by June 6th at the latest!

charlottejmc commented 5 months ago

Hi @hawc2,

I've added in the new images you sent over Slack. I've gone through and ironed out the filenames and numbers to get a little clarity: we now have 9 figures, numbered 01-09 (not 3a, 3b, etc.).

Just a couple notes:

We have found Amy Cesal's guide to Writing Alt Text for Data Visualization useful. This guide advises that alt-text for graphs and data visualisations should consist of the following:

alt="[Chart type] of [data type] where [reason for including chart]"

What Amy Cesal's guide achieves is prompting an author to reflect on their reasons for including the graph or visualisation. What idea does this support? What can a reader learn or understand from this visual?

The Graphs section of Diagram Center's guidance is also useful. Some key points (relevant to all graph types) we can take away from it are:

For general images, Harvard's guidance notes some useful ideas. A key point is to keep descriptions simple, and adapt them to the context and purpose for which the image is being included.

Would you feel comfortable making a first draft of the alt-text for each of the figures? This is certainly a bit time-consuming, but very worthwhile in terms of making the lesson accessible to the broadest possible audience.

Do let us know when you are ready for to hand the lesson over for copyediting.

Thank you very much! ✨

hawc2 commented 5 months ago

@charlottejmc we have our final edits ready, I'll send them to you via email and you can go ahead with copyedits. Thanks!

charlottejmc commented 4 months ago

Dear @hawc2,

Thank you for providing your final Phase 5 edits over email. I've now applied these to the lesson file. Just a couple points to go over together:

I also went back through the issue to look for any un-ticked checkboxes that may have been lost in the back-and-forth.

From Anisa's comment:

From my comment:

From your comment:


I understand you may not want to make all of these changes (quite a few of them are simply suggestions), but it would be very helpful if you could give a final sign-off on what you think of them!

Thanks so much for your work on this. ✨

hawc2 commented 4 months ago

@charlottejmc we're working on a final round of edits of this draft, for you to be able to take over for final copyedits by Wednesday, the 24th of July. I'm adding additional thoughts and comments here to address the remaining concerns and questions. There's a few details here that we are leaving for you and @anisa-hawes to decide whether they should be included during copyedits, such as specific references addressing gaps you mentioned.

EDIT considering the recent edits which removed the quoted sentence, we think it makes sense without the template columns

Two small issues we noticed I'm flagging here, will add additional notes later if our edits bring up other questions:

anisa-hawes commented 4 months ago

Sincere thanks, @hawc2, for all the energy you, Nicole @nlgarlic, and Jeff @jantsen have dedicated to finalising your revisions.

This lesson is now with @charlottejmc for copyediting who aims to complete the work by Friday 2nd August.

@hawc2, @nlgarlic and @jantsen please don't make further edits to your files during this Phase.

Any further revisions can be discussed after copyedits are complete.

Thank you for your understanding.

charlottejmc commented 4 months ago

Hello @hawc2, @nlgarlic,@jantsen and @nabsiddiqui. I've prepared a PR with the copyedits for your review.

There, you'll be able to review the 'rich-diff' to see my edits in detail. You'll also find brief instructions for how to reply to any questions or comments which came up during the copyedit.

When you're both happy, we can merge in the PR.

Thank you!

charlottejmc commented 3 months ago

Hello @hawc2,

This lesson's sustainability + accessibility checks are in progress.

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments

Publisher's sustainability + accessibility actions:

Authorial / editorial input to YAML:

The image must be:

- name: Forename Surname
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Forename Surname is an Assistant Professor in the Department of Subject at the University of City.
    es: |
      Forename Surname es profesor adjunto en el Departamento de Asignaturas de la Universidad de Ciudad.
    fr: |
      Forename Surname est professeur assistant au département des matières à l'université de City.
    pt: |
      Forename Surname é professor assistente no Departamento de Assunto da Universidade da Cidade.

Files we are preparing for transfer to Jekyll:

EN:

Promotion:

anisa-hawes commented 3 months ago

Hello Alex @hawc2, Nicole @nlgarlic, Jeff @jantsen, and @nabsiddiqui,

We're very close to publication!

A few final elements to confirm from Charlotte's comment above:

- name: Nicole 'Nikki' Lemire-Garlic
  orcid: 0000-0002-8988-5188
  team: false
  bio:
    en: |
      Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.
- name: Jeff Antsen
  orcid: 0000-0002-9787-7583
  team: false
  bio:
    en: |
      Jeff Antsen is the owner of Half-Moon Research, an independent consultancy specializing in market research and mixed-methods approaches.

If we have everything together, we could publish this mid-week next week.

Thank you, Anisa

nlgarlic commented 3 months ago

Thanks, Anisa! I’ve attached a signed copy of the declaration form and here is my info:

Name: Nicole “Nikki” Lemire-Garlic Orcid: https://orcid.org/0000-0002-8988-5188 Bio: Nikki Lemire-Garlic is faculty in the Department of Judicial Studies at the University of Nevada, Reno.

anisa-hawes commented 3 months ago

Thank you, @nlgarlic!

Please could you email the copyright form to Charlotte? > publishing.assistant[@]programminghistorian.org. (We don't receive attachments within replies to GitHub comments).

--

As Alex @hawc2 is needing to play the role of both ME + co-author, I've done a final read-through of this lesson, and have made a few line edits: https://github.com/programminghistorian/ph-submissions/commit/25191046f4d5848ad5d45e51a8cdd2cf55d7dae9 (+ a small correction of my own mistake! https://github.com/programminghistorian/ph-submissions/commit/e9609b4ab64526c2c65941e9d9a92b064ecacd5a)