Corpus Analysis with SpaCy

jrladd commented 1 year ago

The Programming Historian has received the following tutorial on 'Corpus Analysis with SpaCy' by @mkane968. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/corpus-analysis-with-spacy

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

mkane968 commented 1 year ago

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

jrladd commented 1 year ago

Thanks @mkane968! I'll start reaching out to potential reviewers.

anisa-hawes commented 1 year ago

I note that Figures, 11, 13, 17 and 20 are not displaying in the Preview. I'm going to take a look and try to assess what needs adjustment.

[x] I've fixed a typing error in the file name of Figure 11 (png was included within the file name). I've also reduced the size of this image to 840 px on longest edge
[x] I've swapped the 'escaped' quotation marks in caption for Figure 13 for single, unescaped quotes
[x] I've fixed a typing error in the liquid syntax for Figure 17 (an erroneous extra letter preceded the filename)
[x] I've fixed a typing error in the liquid syntax for Figure 20 (an erroneous extra letter preceded the filename)
[x] I think it would be good to re-size the images for this lesson, to a maximum of 840px on the longest side. This is important for readers in countries with slower internet speeds.

All images are displaying in the preview now 🙂

jrladd commented 1 year ago

@maria-antoniak and @wjbmattingly have agreed to serve as reviewers, and they'll aim to get their reviews posted here around March 20th (with some flexibility). Thanks, everyone! Let me know if you have any questions - otherwise I'll check back in March.

jrladd commented 1 year ago

The only other thing, @mkane968, would be the image size issue that @anisa-hawes mentioned in her last post. If you don't mind resizing those images, I can help you re-upload them to the repo.

mkane968 commented 1 year ago

Sounds great, thanks for the update, @jrladd! I will aim to have the images resized and sent to you by the end of the week.

jrladd commented 1 year ago

Thanks for sending those files, @mkane968! @anisa-hawes All the resized files have been uploaded, and I checked to make sure everything loaded correctly. We should be all set now until the reviews come in.

maria-antoniak commented 1 year ago

Hi @mkane968! Thanks for the opportunity to read and comment on this tutorial. My review is below and is mostly focused on technical details of the tutorial. Please feel free to let me know if you have any questions!

This tutorial provides a nice introduction to data processing for text corpora using spaCy. The tutorial covers tokenization, lemmatization, POS tagging, and NER (with some references and examples of parsing as well). I think this could be a very useful tutorial for newcomers to NLP, especially those working with specific small text corpora.

Before publication, the tutorial needs to fix some typos, errors, and missing code blocks, which I’ve noted below. Importantly, lowercasing and removal of punctuation needs to happen after using spaCy, not before. spaCy’s parsing and other functions rely on these cues, and we can see the resulting mistakes it makes in FIgures 14-15.

Just my preference, but I might run the spaCy pipeline just once, saving the doc objects as I go. Then I’d use these doc objects later to extract tokens, lemmas, etc. This seems clearer to me than re-running in each section, and it will definitely be faster.

Paragraph 1: The use of the word “clean,” while common in reference to “data cleaning,” nevertheless connotes that the original data is “not clean.” Similar objections exist to terms like “tidy” and “raw data.” Personally, I try to avoid these terms in favor of terms like “process,” “analyze,” or “prepare.” Not hugely important, but something to consider.

Paragraph 1: Some contradiction between these paragraph and Paragraph 18. Is experience with spaCy required or recommended? And if someone has experience with spaCy, do they need help with installation, uploading data to a Python notebook, and running basic functions like lemmatization? I might reframe to more of a beginner framing, “no knowledge of spaCy required,” since the tutorial reads like an “Intro to NLP” tutorial rather than a “NLP 201” tutorial.

Paragraph 1: I think you could sell this tutorial more strongly. Who would benefit most? What kinds of questions might they have? I would set this up in the very first paragraph. “Do you have a big dataset of text, but you’re not sure what to do with it? Maybe you want to use it as input for some machine learning model, but you’re not sure of the first steps?” Or you could lead with some fun example. “Imagine you’re a historian with a dataset of French Revolution speeches. You want to use these speeches as input to a machine learning model, but first, you need to get your data into a format that the computer with recognize…”

Paragraph 4: Missing the final take away, which is that computational tools can do this more quickly than a human and at scale.

Paragraph 4: SpaCy is mostly a set of functions built around a parser, so probably parsing should also be mentioned here.

Paragraph 4: First bulleted item, “nlp” should be capitalized.

Paragraph 4: Second bulleted item, perhaps instead of “best” (given recent dramatic advances), this could be changed to “fast and accurate algorithms.”

Paragraph 7: Change dashes to em dashes.

Paragraph 7: “Given its value in retrieving linguistic annotations…” The use of the word “retrieving” is throwing me off a bit. I would probably use “discovering” or “predicting.” Just a nit pick, might just be me.

Paragraph 7: Could be nice to have some visual here of the dataset (screenshot of the website? histogram of essays per genre?), or some example snippets of the text.

Paragraph 8: “huge corpora” 800 documents is very small. I'd just say “larger corpora,” and it might be worth including a note somewhere about dataset scales and what size of data people can expect to process.

Paragraph 9: “clean” again, and here I object a bit more strongly since it’s in conjunction with stopword removal. Stopwords are often very useful :)

Paragraph 10: In the bullet point, “has been characterized informational” missing “as”

Paragraph 19: Might want to call out that normally, you’d install spaCy into your local Python environment using pip, conda, or similar environment management tools. I got a little lost here at first, since usually I only run these kinds of commands when working in Colab, not locally, and so I thought I was in a Colab section of the tutorial (and then realized that no, this section was intended for both Jupyter and Colab).

Paragraph 19: If they haven’t yet installed spaCy, and if they’re working locally, they’ll also need to download the English model. I’d point them again here to the spacY installation instructions.

Paragraph 20: Code block nit pick: PEP 8 style specifies one space after the # when starting a comment.

Paragraph 20: Not sure what the asterisk at the end of the paragraph is referencing, or perhaps it’s a typo.

Paragraph 21: I’d show these commands in a code block.

Paragraphs 23-25: This could be condensed, if you want. I think you could do something like the following:

texts = []
file_names = []
for _file_name in os.listdir(path_to_directory):
    if _file_name.endswith(‘.txt’):
        texts.append(open(path_to_directory + ‘/’ + _file_name, 'r').read()
        file_names.append(_file_name)

I would recommend specifying the path_to_directory explicitly rather than using os.chdir, just for clarity for novices who have a hard time locating files and using paths. Especially so since Paragraph 29 asks them to specify a different path.

Paragraph 27: If you call .head(), the output shouldn’t look like Figure 1. It should show the first N rows.

Paragraph 29: In the code block, is FILE_NAME actually the name of the file?

Paragraph 30: Show these steps in a code block.

Paragraph 33: “Run the following code…” The code isn’t shown?

Paragraphs 33-36: Possibly might be easier to just have them upload the files directly to Colab, rather than Google Drive and then Colab.

Paragraph 37: Don’t need to re-explain dataframes here, since we already encountered dataframes above.

Paragraph 42: “For rows should be present…” → “Four columns should be present…”

Paragraph 45: “ids” → “IDs”

Paragraph 47: The actual join is missing. Show the code block where you combine the dataframes.

Paragraph 50: These steps aren’t strictly necessary. In particular, there could be very good reasons for leaving the UTF-8 characters.

Paragraph 51: You shouldn't lowercase the text before running spaCy. SpaCY does not need this and indeed uses casing for NER. There’s no danger of “House” and “house” being counted as separate words; that’s partly what the lemmas are for. If you want to lowercase, this should be done after running spaCy.

Paragraph 52: You shouldn't remove punctuation before running spaCy. spaCY uses this for parsing and other operations. If you want to remove punctuation, this should be after running spaCy.

Paragraph 66: Stopwords can be very useful, e.g. for clustering and classification tasks. In fact, I’d expect some interesting stopword patterns in this essay data. I’d at least note this here.

Paragraph 68: Show the code.

Paragraph 71: Citation? I’m not sure this is true. In general, removing stopwords isn’t very helpful and can sometimes be harmful. One case where it is helpful is removing stopwords from topic model output to increase their legibility.

Paragraph 74: Stopword removal isn't useful for topic modeling, only for the aesthetic legibility at the end for human readers. See Alexandra Schofield’s work.

Paragraph 83: Stemming or lemmatization can be harmful for topic models. See Alexandra Schofield’s work.

Figures 14-15: These sentences and parses don’t make a lot of sense, likely because of the lowercasing and removal of punctuation.

Paragraph 114: This code block could be simpler. Why not take advantage of the POS tagging already done above? Iterate over the saved POS tags and just count those, instead of redoing everything here.

Figures 22-23: These look like counts, not averages.

Paragraph 119: I think the reference footnote isn’t rendering correctly?

Figure 24: These aren’t averages (unless those numbers are percents? though then they don’t seem to add to 100), which makes the bars hard to compare across genres.

Figure 26: Too big.

Paragraph 136: Might be good to point to newer tutorials, as there are better tools and methods available now.

jrladd commented 1 year ago

Thanks for this thoughtful review, @maria-antoniak!

You can read this over anytime, @mkane968, but no need to take any action until after we've received the second review. Then we'll make a plan for revision together. Let me know if you have any questions in the meantime!

wjbmattingly commented 1 year ago

Hi @mkane968 thanks for letting me review this new addition to the Programming Historian. Please feel free to let me know if you have any questions. I tried to keep my comments unique and not overlap with the other reviewer.

This tutorial introduces readers to the programming library spaCy. It is a welcome tutorial as it provides a concrete use-case for newcomers to the field of NLP with a particular aim at humanists. The author correctly adds the essential steps of data cleaning in this tutorial, something that is often overlooked in other tutorials. Data is messy. These steps of the tutorial will surely be a great help to many students. The author’s goals are to teach the audience 5 how to do five things: 1) upload a corpus to a cloud-based platform, 2) clean a corpus, 3) enrich a corpus, 4) perform frequency analysis, and 5) download the enriched dataset. The author succeeds in each area.

Overall, I think a bit more exposition in certain areas may be helpful for some readers. I make a note about this throughout my review, but I always feel that code should be written differently for tutorials vs. production. A lot of the code is beautiful production-ready code, but may be a bit terse for a tutorial. Tutorials are for those new to a library and maybe even new to Python. The use of lamda, map, list comprehension, while certainly best practice, may confuse some readers.

In P.3, the author could expand a bit more on “Though computational tools can’t “read” texts like humans do, they excel at identifying the lexico-grammatical patterns (e.g. key words, phrases, parts of speech) that corpus analysis researchers are looking for.” Many readers may not understand what the author means by ‘“read” texts like humans’. A deeper explanation here may be helpful or some. Just a sentence or two.

Footnote issue. I am having trouble seeing the footnote links. Perhaps this is a property of the text at this stage? Perhaps it is a browser issue?

P. 19. The author expects this to be run in a notebook, but perhaps clarify that !pip is used in a notebook cell, not in the CLI. Not all readers will know this.

P. 27, the image produced does not show the .head() command, rather it is the standard paper_df command that shows the entire df with a truncated middle section. Perhaps the image here should align with the command?

P. 33, perhaps a few other images here to help guide users through this process would be helpful. For those new to Google Collab, a lot can go wrong here and it may be helpful to have a visual of what to expect.

P. 36, You use the type function, but do not show the result. Perhaps an image here displaying the output?

P. 46. While I completely agree with the choice here to remove the .txt, those new to Python will be quite confused by the use of lamda here. Some may have never seen it. I would either explain briefly what is happening or opt for a different, more verbose and clearer approach. Perhaps you could do this at an earlier stage when you are creating your dictionary of filenames and text? Just a thought.

P. 49, perhaps explain a bit about UTF-8 and why this can raise issues. Also, did you mean remove any ‘non’-utf-8 characters? I wouldn’t presume the audience knows about utf-8, especially those who do not work with older or non-English texts.

P. 65 the author uses the disable feature of spaCy. This is the correct approach, but explain to the reader in more detail about why disabling pipes is good practice in this scenario. You make a note of it inside the code, but perhaps some exposition outside the code would be helpful here.

P. 70. Here we see list comprehension and the use of join and map and the creation of a new dataframe column all in the same line. While certainly great code, this could be a bit difficult for readers for some readers of an introduction tutorial to follow. I would recommend either explaining this a bit more or taking a more verbose approach. This could be just my style, but I always write code differently for tutorials vs. production. This feels more like production code, rather than the code one would see in a tutorial for new users of a library. Again, this is just a stylistic recommendation.

P. 80 Same note as above.

P. 85 Same note as above.

P. 88 for each of these visualizers, we are seeing images. I like the way you are panning across the image here, but perhaps you can leverage the raw HTML outputted by displaCy for both POS and NER? This could be embedded into the tutorial and allow the reader to scroll for themselves across the output as well.

P. 96 perhaps a typo here: ‘certain parts of speech, like nouns,, are retained’ with two commas.

P. 125 does a great job of providing some analysis about the data.

jrladd commented 1 year ago

Thanks so much for your review, too, @wjbmattingly!

I'm at a conference this week, but I'll comment early next week with some further thoughts and next steps. Excited to move to the next stage. 🎉

wjbmattingly commented 1 year ago

@jrladd no problem at all! Happy to help and happy to see another spaCy tutorial out there for the humanities!

jrladd commented 1 year ago

Thanks again to both our reviewers! As far as next steps, @mkane968, I think the reviews are basically in agreement about the path for revision:

I'd pay particular attention to the places where the reviewers asked for a little more context or framing. None of these are asking for a huge amount of extra writing, but the additional clarifications will help to drive home the central ideas in the tutorial. You could focus on the question of why readers will be interested in spaCy and what you expect them to get out of the tutorial.
There are lots of good notes here about how to make the code clearer for your reader, and sometimes that will involve including a little more code or making what you have a bit more verbose.
Lastly, you might consider your approach to data preparation (use of stopwords, lowercase, lemmatization), either to adjust things for better results or to make the reasons behind your current approach clearer.

I hope those items are helpful as you go through these reviews and consider your final round of revisions. Please do chime in here with any questions you have as you work, about specific things in the reviews or anything else that comes up. That's what this thread is for!

The typical timeline for this next stage of revisions is about a month. How about Monday 8 May as a deadline to aim for? Let me know if that works, and that can be adjusted as we go along, too. Thanks for all your hard work on this!

mkane968 commented 1 year ago

Thanks so much for your detailed reviews, @maria-antoniak and @wjbmattingly , and for your helpful synthesis of my next steps, @jrladd! I will get to work on these revisions. I will aim to have the revised draft submitted by Monday 8 May and will let you know of questions as they come up. Excited to continue working on this! Thanks again!!

mkane968 commented 1 year ago

Hi @jrladd, I hope you're well! I'm working through the revisions this week and have a logistical question: should I upload the revised files to the same Github page we've been using and send you the updated image files? Or do you want any/all of these files shared to this thread, or in a different format or location?

anisa-hawes commented 1 year ago

Hello @mkane968. Thank you for your question.

You can update the .md file directly if you would like to ? It is stored here /en/drafts/originals/corpus-analysis-with-spacy.md.

Alternatively, you can email me the revised version of your file and I can update it.

--

For any updated images and / or data assets, please email them to me and I'll upload these for you. My email address is: admin[@]programminghistorian.org.

Very best, Anisa

mkane968 commented 1 year ago

Thanks Anisa, I will update the markdown file directly and send the revised images and assets to you! Best, Megan

mkane968 commented 1 year ago

Hi @anisa-hawes, I've made changes in the markdown file for my tutorial and sent you an email with the revised assets. Let me know if there's anything else needed from me! Thanks, Megan

jrladd commented 1 year ago

Thanks @mkane968! 🎉

The next steps are for me to check everything and get all the metadata ready for publication. I'll get started on that, but it will probably take a couple weeks (we're currently in the midst of finals here).

In the meantime, can you send me your name as you'd like it to appear on PH, your ORCiD if you have one, and a one- or two-sentence bio? You can see a recent example bio at the bottom of this tutorial.

mkane968 commented 1 year ago

Sounds great, thanks! Here is that information:

Name: Megan S. Kane ORCiD: 0000-0003-1817-2751 Bio: Megan Kane is a PhD candidate in the English Department at Temple University.

On Tue, May 9, 2023 at 11:38 AM JR Ladd @.***> wrote:

Thanks @mkane968 https://github.com/mkane968! 🎉

The next steps are for me to check everything and get all the metadata ready for publication. I'll get started on that, but it will probably take a couple weeks (we're currently in the midst of finals here).

In the meantime, can you send me your name as you'd like it to appear on PH, your ORCiD if you have one, and a one- or two-sentence bio? You can see a recent example bio at the bottom of this tutorial https://programminghistorian.org/en/lessons/creating-guis-in-python-for-digital-humanities-projects .

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/546#issuecomment-1540417641, or unsubscribe https://github.com/notifications/unsubscribe-auth/APMP3ILBR6HOLVWW2DURJ4LXFJQGBANCNFSM6AAAAAAUJB7M4A . You are receiving this because you were mentioned.Message ID: @.***>

anisa-hawes commented 1 year ago

Hello @mkane968,

I saw that you already merged in the PR you created – as Alex said, we are happy for you to update the .md file directly in this repo. Thank you for sending me your updated images – I've uploaded them (I noticed that by resizing in a batch action, you had inadvertently introduced extra white space around the smaller figures, so I have cropped those).

You mentioned in your email that you'd like to share the raw .html output of figures 16, 17 and 20 as well as the images. I have uploaded these as /assets, which we can link to (so readers can download and review them). We'd need to add a sentence in each case which includes the link. For example: If you'd like to review this output as raw .html, you can download it here. Please let me know how you would like to phrase this? And also where you'd like to slot these sentences in.

--

Hello @jrladd and @hawc2,

This lesson is now moving into the Sustainability + Accessibility phase of our workflow.

I'll be coordinating the following sequence of tasks, and will keep you updated on the status of these actions using this checklist. Sustainability + accessibility actions status:

[x] Copyediting (I will co-ordinate this. We will be working with an external copyeditor)
[ ] Typesetting (I do this following the copyedit)
[ ] Addition of Perma.cc links (I do this following the copyedit)

--

Hello @jrladd and @mkane968 I'd like to ask for your help with the following:

[x] Addition of alt-text for all figures. I've added in some template text alt="Visual description of figure image", and this guidance may be useful to you.
[x] Define the lesson's difficulty: level, based on the criteria set out here
[x] Define the research activity: this lesson supports (acquiring, transforming, analysing, presenting, or sustaining)
[x] Define the lesson's topics: (apis, python, data-management, data-manipulation, distant-reading, set-up, linked-open-data, mapping, network-analysis, web-scraping, digital-publishing, r, or maching-learning)
[x] Provide a short abstract: for the lesson
[x] Receipt of authorial copyright agreement, which is available to download here. Megan, please could you fill this in and email it to me: admin [@] programminghistorian.org? I note that you posted the 'Permission to Publish' statement above, but we are phasing this out and have have introduced a more formal agreement.
[x] Thank you for preparing your author bio. (I have translated it, but only the EN is actually required until this lesson is translated in the future!)

- name: Megan S. Kane
  orcid: 0000-0003-1817-2751
  team: false
  bio:
    en: |
      Megan Kane is a PhD candidate in the English Department at Temple University.
    es: |
      Megan Kane es doctoranda del Departamento de Inglés de Temple University.
    fr: |
      Megan Kane est candidate au doctorat dans le département d'anglais de Temple University.
    pt: |
      Megan Kane é candidata a doutoramento no Departamento de Inglês da Temple University.

--

When everything is ready, I'll support Alex with the final steps towards publication:

[ ] Selecting + uploading an image for the lesson (incl. avatar_alt:)
[ ] Preparing x2 'evergreen' posts for our Twitter/Mastodon Bot + an announcement
[ ] Requesting a DOI

--

Key files can be found here:

.md file: /en/drafts/originals/corpus-analysis-with-spacy.md images: /images/corpus-analysis-with-spacy assets: /assets/corpus-analysis-with-spacy

The Lesson Preview is available here: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/corpus-analysis-with-spacy

mkane968 commented 1 year ago

Hi @anisa-hawes,

Thanks for adjusting the images; apologies for including the extra whitespace!

Regarding the html output, I think it could be phrased and positioned as follows:

[x] Asset 16: "If you'd like to review the output of this code as raw .html, you can download it here." - Put text in between the code block between Figure 16 and paragraph 93
[x] Asset 17: "If you'd like to review the output of this code as raw .html, you can download it here." - Put text in between the code block between Figure 17 and paragraph 96
[x] Asset 20: "If you'd like to review the output of this code as raw .html, you can download it here." - Put text in between the code block between Figure 20 and paragraph 107

I sent you the authorial copyright agreement and can start working on the alt text and other tasks.

Thanks,

Megan

anisa-hawes commented 1 year ago

Thank you, @mkane968. I've added those x3 sentences + links to the .html assets: https://github.com/programminghistorian/ph-submissions/commit/217067978ea5373813945abf42e784cb4d5cc7e0

Thank you also for sending the copyright form – safely received and uploaded.

I'd be happy to slot in the alt-text if you'd like to comment it here, or send it to me by email. Alternatively, you can edit the .md file directly: /en/drafts/originals/corpus-analysis-with-spacy.md. Whatever is simplest for you.

mkane968 commented 1 year ago

Great, thanks @anisa-hawes, I've added the alt-text tags into the markdown file!

mkane968 commented 1 year ago

Hi @jrladd, Just wanted to touch base about the next steps for my lesson. Based on the definitions Anisa shared, I think the difficulty, activity and topics can be defined as follows:

Difficulty: Beginner
Activity: Transforming and Analyzing
Topics: data manipulation, distant reading, python

Do these make sense, and/or are there any to add?

Also, here is the draft of a short abstract for my lesson:

This lesson demonstrates how to use the Python library spaCy to analyze large collections of texts. It details the process of using spaCy to enrich a corpus via lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition. It also highlights how the linguistic annotations produced by spaCy can be analyzed to help researchers explore meaningful trends in language use across their texts, using the Michigan Corpus of Upper-Level Student Papers as an example corpus.

Thanks!

Megan

anisa-hawes commented 1 year ago

Hello @mkane968, Thank you for these notes! I've updated the YAML accordingly.

The next step is copyediting, which has now been assigned to one of our outside collaborators. I will be in touch in ~14 days that the copyedit is complete, and to ask you/John any questions/clarifications.

After that, I'll typeset the lesson and generate perma.cc links for any external web sources.

Then our Managing Editor (Alex) will undertake final pre-publication checks.

jrladd commented 1 year ago

Thanks @mkane968 for these updates, and thanks @anisa-hawes for all your help as I regrouped over the last couple weeks!

The only suggestion I might make is to give this lesson an Intermediate difficulty. You've made this topic very accessible to folks new to it, but I think the amount and type of coding here makes this an intermediate lesson by PH standards.

Let me know if there's anything more I can do as we head to the copyedit stage, and thanks both for all your hard work!

anisa-hawes commented 1 year ago

Thank you, @jrladd. I've made that adjustment: https://github.com/programminghistorian/ph-submissions/commit/7ed0ea61978083f2b19222b15aac940ce2f317e7

hawc2 commented 1 year ago

@anisa-hawes will this lesson be ready for publication in August?

hawc2 commented 1 year ago

@mkane968 were you going to share a Colab notebook with your lesson? If you have a Jupyter Notebook that you prepared for this lesson, we can include that in the /assets folder and host it through a ProgHist Google Colab account, linking it in the lesson for readers who want to run the code that way. We ask that if you do include a Colab Notebook with the code, it only retains the basic headings and code comments, but not the extended commentary published in the ProgHist lesson.

mkane968 commented 1 year ago

Hi @hawc2, Yes, here is the link to the Colab notebook with basic headings and code comments: https://colab.research.google.com/drive/1OiH-zdvWDciC8vNI6P_PyBN9faXOzZNY?usp=sharing The Jupyter notebook file is in the assets folder for my lesson: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/corpus-analysis-with-spacy Let me know if anything else is needed. Thanks! Megan

On Thu, Jul 27, 2023 at 5:54 PM Alex Wermer-Colan @.***> wrote:

@mkane968 https://github.com/mkane968 were you going to share a Colab notebook with your lesson? If you have a Jupyter Notebook that you prepared for this lesson, we can include that in the /assets folder and host it through a ProgHist Google Colab account, linking it in the lesson for readers who want to run the code that way. We ask that if you do include a Colab Notebook with the code, it only retains the basic headings and code comments, but not the extended commentary published in the ProgHist lesson.

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/546#issuecomment-1654634883, or unsubscribe https://github.com/notifications/unsubscribe-auth/APMP3IP76GJLOZB5QSV45X3XSLPQ7ANCNFSM6AAAAAAUJB7M4A . You are receiving this because you were mentioned.Message ID: @.***>

mkane968 commented 1 year ago

Hi @anisa-hawes, @jrladd, and @hawc2, Just checking in about the status of the lesson. Is there anything else you need from me at this point? Thanks! Megan

hawc2 commented 1 year ago

@mkane968 copyediting hit a snag, but we have a new Publishing Assistant who should be able to complete copyedits on your lesson shortly. I hope we can publish by early October. Sorry for delays!

anisa-hawes commented 1 year ago

Dear @mkane968 and @jrladd,

The copyedits for this lesson have already been completed by our external copyeditor, and I've prepared them to commit in PR #590. I'd be grateful if you could review the adjustments and confirm that you are happy for me to merge these.

Please accept my apologies for the snag at this stage of the process. When I hear from you that you're satisfied with the edits, I will expedite this lesson onwards for typesetting and final preparation for publication.

In the meantime, I have prepared Megan's .ipynb in our organisational Google Colab repository according to the conventions we are currently developing.

anisa-hawes commented 1 year ago

Thank you for reviewing, approving + merging the copyedit PR @mkane968 and @jrladd!

My confusion about the link to the .ipynb in Megan's personal repo is that this is the only .ipynb file linked in the lesson.

Where are you directing readers to the Colab notebook you created for our repo that replicates the lesson's code as active cells? At the moment, there is no link to this notebook within the lesson.

At line 78 you write:

Two versions of code are provided for this lesson: one version to be run on Jupyter Notebook and one for Google Colab. The versions are the same except when it comes to the process of retrieving and downloading files. Because a Jupyter Notebook is hosted locally, the files stored on your computer can be accessed directly. Google Colab, on the other hand, is cloud-based, and requires files to be uploaded to the Colab environment. This lesson will note such divergences below and first explain the code for working with Jupyter Notebook, then for Google Colab.

I think we need to add links to the associated materials for both versions of the code here.

--

My team's next steps will be:

Typesetting + final checks of metadata Generating archival hyperlinks

Our Publishing Assistant Charlotte will prioritise these tasks for completion this week ~29th September.

Very best, Anisa

hawc2 commented 1 year ago

Maybe it would be best to keep it to just one Jupyter Notebook, with the code catered to run on Colab. You could include additional code in a subsection, commented out, that directs users how to run the code locally on their computers? It's hard to make code for every situation, so it's ok by me if you also just decide to remove the local code version.

mkane968 commented 1 year ago

Oh okay, I understand! Here is the finalized Google Colab file: https://colab.research.google.com/drive/1OiH-zdvWDciC8vNI6P_PyBN9faXOzZNY?usp=sharing I can add it to line 78 if needed.

Alternatively, per Alex's comment, I can email a Jupyter Notebook with sections for uploading/downloading files in Google Colab (I can't attach it to this thread). Happy to revise the code to reflect this change instead.

mkane968 commented 1 year ago

Hi @jrladd , @hawc2 , @anisa-hawes, I added the Jupyter Notebook with sections to upload/download files in Google Colab to the lesson's assets folder, since it was not letting me send it via email. It's currently called "corpus_analysis_withspacy(Jupyter+Colab).ipynb" to differentiate: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/corpus-analysis-with-spacy/corpus_analysis_with_spacy_(Jupyter%2BColab).ipynb I ran it this morning to confirm it worked on both platforms. If we're going with this version rather than the separate Colab notebook, let me know if I can help revising the .md file to reflect the change, or if anything else is needed from me. Thanks! Megan

anisa-hawes commented 1 year ago

Thank you, @mkane968. I'll take a look.

To be clear: you intend this notebook to replace the previous?

My next steps will be:

Delete this one
Upload this one to our organisational Colab repo and then attach it to GitHub
In the process, I am going to rename it corpus-analysis-with-spacy.ipynb because I want to links to follow new conventions and avoid all special characters.

There's no further action needed on your side, @mkane968. Thank you.

Best, A.

mkane968 commented 1 year ago

Sounds great, thanks @anisa-hawes! Yes, the notebook I just added to the repository can replace the previous one and includes code to run in both Colab and Jupyter Notebook.

anisa-hawes commented 1 year ago

Excellent. I've uploaded the new version of the notebook, and I've adjusted line 78 so that it reads:

The code provided for this lesson can be run on either Jupyter Notebook or on Google Colab. The practical steps are the same except when it comes to the process of retrieving and downloading files. Because a Jupyter Notebook is hosted locally, the files stored on your computer can be accessed directly. Google Colab, on the other hand, is cloud-based, and requires files to be uploaded to the Colab environment. This lesson will note such divergences below and first explain the steps for working with Jupyter Notebook, then for Google Colab.

charlottejmc commented 1 year ago

Hello @hawc2,

This lesson is ready for your final review + transfer to Jekyll.

Preview:

EN: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/corpus-analysis-with-spacy

Publisher's sustainability + accessibility actions:

[x] Copyediting
[x] Typesetting
[x] Addition of Perma.cc links
[x] Check/resize images
[x] Check/adjust image filenames
[x] Receipt of author(s) copyright agreement
[x] Addition doi

Authorial / editorial input to YAML:

[x] Define difficulty:
[x] Define the research activity:
[x] Define the lesson's topics:
[x] Provide alt-text for all figures
[x] Provide a short abstract: for the lesson
[x] Agree an avatar (thumbnail image) to accompany the lesson (what about this one?)

The image must be:

copyright-free

non-offensive

an illustration (not a photograph)

at least 200 pixels width and height Image collections of the British Library, Internet Archive Book Images, Library of Congress Maps or the Virtual Manuscript >Library of Switzerland are useful places to search

[ ] Provide avatar_alt: (visual description of that thumbnail image)
[x] Provide author(s) bio for ph_authors.yml:

- name: Megan S. Kane
  orcid: 0000-0003-1817-2751
  team: false
  bio:
    en: |
      Megan Kane is a PhD candidate in the English Department at Temple University.

Files to prepare for transfer to Jekyll:

EN:

.md file: /en/drafts/originals/corpus-analysis-with-spacy.md
images: /images/corpus-analysis-with-spacy
assets: /assets/corpus-analysis-with-spacy
original avatar: /gallery/originals/corpus-analysis-with-spacy-original.png
gallery avatar: /gallery/corpus-analysis-with-spacy.png

Promotion:

[x] Prepare announcement post (using template)
[ ] Prepare x2 posts for future promotion via our social media channels

hawc2 commented 1 year ago

Thanks @charlottejmc for preparing this lesson's copy edits. I've reviewed it and made additional line edits. @mkane968 this lesson is almost ready for publication. However, you'll see I've identified a few remaining issues that we'll need your input on. One key thing to note is that this lesson is over our word limitation of 8000 words, I have a feeling alot of that is alt-text for figures and bibliography, but still, if there are places to cut during these final edits, always good to make the lesson more concise.

[x] One thing I think could be further reduced is the commentary on Jupyter Notebooks and Google Colab. It's important to note that the lesson currently draws a distinction between these two things that isn't quite true. Really the distinction is between Locally-Hosted Jupyter Notebooks and Google Colab-hosted Jupyter Notebooks. Either way, you're working in a Jupyter Notebook, regardless of whether you're using Google Colab, Binder, Anaconda, etc. Jupyter Notebooks are the coding interface, the question is whether your coding environment is hosted locally or on the cloud. Google Colab, as a hosting platform, is cloud-hosted, and it also offers additional cool features. It's worth distinguishing in that way (I added a brief note about this in the lesson). Part of this confusion could be addressed by including some standard definitions of key terms like this that are drawn from and link to wikipedia. You'll see in the Upload Text Files section I've reorganized headings so they focus on how the code is teaching the reader about dataframes and wrangling text files, which is more important than different ways of uploading files to Jupyter Notebooks. Megan, please review these changes and make any last edits if you think something is now less clear than you meant it to be. It would also help if the code for the lesson we're linked more than once in the lesson - it's easy to miss it the first time it's mentioned, it may be worth linking again at the start of the first section where the reader actually starts coding. For a good model for how to minimize discussions of differences in coding platforms, look at how you do it near the end, for the Download Enriched Dataset section. It's arguably not even necessary here to show how to do it both with and without Colab, but at least here it is isolated to a brief section about downloading, and it doesn't disrupt another lesson step (like the uploading section got conflated with data wrangling steps). If you are ok with cutting some of this down, please feel free to make those edits to simplify the discussion of the coding platform and focus on the NLP code itself.
[x] On line 216, it says When the cell is run, choose files, navigate to where you have stored the metadata.csv file, and select this file to upload:
```
metadata = files.upload()
```
Was there supposed to be a link to a file here? This part was confusing for me, because it proceeds to tell the reader to do a series of difficult steps converting this new file to a dataframe, but the code isn't provided to do so:.
[x] On line 223, it says "Then convert the uploaded .csv file to a second DataFrame, dropping any empty columns and display the first five rows to check that the data is as expected. Four rows should be present: the paper IDs, their titles, their discipline, and their type." I'm concerned that the reader won't be able to figure out how to do that based on what they've been taught so far. Can you provide the code for this section in the lesson?
[x] Something that confused me is how spaCy distinguishes between 'pipelines' and 'models'. It might help on line 114 to add a sentence to explain what this model is like and explain how spaCy distinguishes the two. In my mind, a model is built by running an algorithm on a dataset, whereas a pipeline is a coding script that can be used to build a model or that we use to take a model to use when we analyze a new dataset. The 'en_core_web_sm' I always thought of as a model, but on line 255 it's referred to as a pipeline. I see spaCy uses that language too, I just don't really understand what they mean by it in this context. Make sure you're being consistent with this terminology, and ideally define what you mean by it in a sentence.
[x] In paragraph 345, there's a repeat sentence from paragraph 336: "Lemmatization can help reduce noise and refine results for researchers who are conducting keyword searches." In this second instance in paragraph 345, I could see deleting this sentence, but I'm not really sure how it relates to the following sentences citing Matthew Lavin. Could we cut this whole last part of this paragraph, or can you write more to explain how this citation is relevant to what you're showing the reader to do here? Note if we cut the rest of this paragraph, we should remove the citation from the bibliography as well.
[x] In paragraph 505, I had a really hard time understanding this sentence: "To associate the actual parts-of-speech associated with each index, a new dictionary can be created which replaces the index of each part-of-speech for its label." Can you edit this sentence to clarify your meaning here?
[x] The section on part-of-speech tagging does run on for a while. I broke it down into two sections with different headings, to distinguish a sub-section on Fine-Grained Part of Speech Analysis. Megan, if you can think of better headings, you can make those changes. It may also be worth considering shortening this section if at all possible. The Named Entities section could also afford to have an additional subheading, you'll see I added one for "Analyzing Dates in Named Entities".

I think that covers all the remaining changes I'd suggest before publication. Once we can address these outstanding issues, we can move forward to publish, later this week or next!

anisa-hawes commented 1 year ago

Thank you, @hawc2. These copyedits were done by an external colleague in our network. (But Charlotte has done the typesetting and applied perma.cc links).

Charlotte and I are very happy to review it one more time when Megan's edits (in response to your comments above) have been addressed.

After that, you and I can work together to transfer the files and prepare to publish ☺️

mkane968 commented 1 year ago

Hi @hawc2 and @anisa-hawes, Thanks for the feedback! I will aim to review and make the edits in the next few days. Best, Megan

anisa-hawes commented 1 year ago

Hello Megan @mkane968,

Thank you for your edits (https://github.com/programminghistorian/ph-submissions/commit/cfda54be89e92af960a30f182d9de0a71e8026d3). Are there further changes you'd still like to make?
Let us know when you're ready to hand this over to us. Thank you, Anisa

mkane968 commented 1 year ago

Hi @anisa-hawes , Yes, I need to make a couple more edits still, I'll work on them asap and let you know when complete! Megan

mkane968 commented 1 year ago

Hi @anisa-hawes, I have finished my edits in response to @hawc2's review!

At his recommendation, I have reframed the tutorial so it is focused on running code in Google Colab only. I still want to share alternate code for working in a local Jupyter Notebook in the .ipynb file, but do not discuss the differences throughout the tutorial.

I revised the .ipynb file and confirmed it works in Colab and as a local Jupyter Notebook. Here is the link: https://colab.research.google.com/drive/1VWUKc11xk-A7MHw0_5nEuE3FgBfApc5y?usp=sharing

With this change in mind, there are few places where the images need to be updated to reflect that Colab is being used:

Figure 5 should be replaced with figure that shows head of metadata dataframe in Google Colab (not Jupyter Notebook)
Figure 11 should be replaced with figure that shows proper noun output from Colab (not Jupyter Notebook code)
Figure 17 should be replaced with figure that shows part of speech indexing output from Colab (not Jupyter Notebook code)
Figure 18 should be replaced with figure that shows part of speech labeling from Colab (not Jupyter Notebook)

I'm attaching the new figures to this message, happy to assist with uploading the files if needed.

In giving the lesson a final read-through, I also noticed the following:

The image files for Figure 26 and Figure 27 should be switched. The captions are correct (as well as the text discussing the images), but the images are wrong. The list of words with more numerical dates corresponds to "Proposals," and the other to "Critique/Evaluations."
There is no longer a Figure 2 in the tutorial; it was a file related to the local Jupyter Notebook. So the rest of the figure names need to be updated. I can do this but wasn't sure if it would confuse anything with the names of the image files.
The raw html link to the dependency parsing output does not work (404 page not found error). Not sure if this is just because it's not yet published?

Hope this all makes sense, happy to clarify anything needed!

Megan

New Figure 5

anisa-hawes commented 1 year ago

Thank you, @mkane968! We're grateful for your further work. No need to do anything more – we'll process the images and update our internal copy of your .ipynb.

Charlotte and I will take this forward next week, and we'll be in touch if we have any questions.

Best, Anisa

mkane968 commented 1 year ago

Hi @anisa-hawes, A quick update: @hawc2 asked me to add a couple Wikipedia links when I first mention core coding concepts. I added these in lines 30 and 82 (Python, Jupyter Notebooks). Let me know if anything else is needed. Thanks! Megan

programminghistorian / ph-submissions

Corpus Analysis with SpaCy #546

Anti-Harassment Policy