Review Ticket: Working with batches of PDF files

amsichani commented 4 years ago

The Programming Historian has received the following proposal for a lesson on 'Working with batches of PDF files' by @maehr. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/working-with-batches-of-pdf-files

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

@amsichani will act as editor. Her role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our Ombudsperson (@amandavisconti). Thank you for helping us to create a safe space.

amsichani commented 4 years ago

Thanks @maehr for this submission 👋 👋 I will be reading the lesson and providing some feedback for you to respond to, and then I will solicit formal reviewers. I anticipate being able to get an initial read to you back by the end of next week. I'll let you know if anything changes, and in the meantime let me know if you have any questions about the process.

amsichani commented 4 years ago

Hi @maehr , before posting my initial feedback on the tutorial, may I ask you on the lesson's layout? You 've used a different template to generate the lesson's preview? there are a number of formatting issues related to this, as you can see, and it might worth correcting them before moving to the next phase of peer review, as it is really hard to read (esp the last sections). Cheers!

maehr commented 4 years ago

Hi @amsichani I started to work on this lesson before the new guidelines were released, so there might be a mixup. Can you please be more specific and point out, what parts need correction? Thanks and best regards Moritz

amsichani commented 4 years ago

'Text recognition in PDF files' and til the end is really messed up - v difficult to even read it.

maehr commented 4 years ago

I removed alert divs with inline code / code blocks and yaml errors; hopefully it helps. I cannot build the jekyll site locally because of parsing errors of other lessons.(The repo gh pages are not rebuild at the moment, so the change is only visible over here https://github.com/programminghistorian/ph-submissions/blob/gh-pages/lessons/working-with-batches-of-pdf-files.md )

mdlincoln commented 4 years ago

@amsichani please refer to the updated guidelines: https://programminghistorian.org/en/editor-guidelines#3-add-yaml-metadata-to-the-lesson-file

unfortunately what Adam had posted up briefly included a lot of square brackets [] which was a fatally incorrect thing to do. If you remove all those (and the orcid ID from the authors which is not at all in the specification???) then this builds fine.

mdlincoln commented 4 years ago

The issue with formatting going on after Text Recognition in PDF files is that you try to put markdown inside the HTML block of the <div class="alert alert-info"> Once you start an HTML block inside a markdown document, everything in there needs to be HTML, not markdown.

amsichani commented 4 years ago

Many thanks @mdlincoln for this! @maehr could you amend this bit so we can have a clear reading version of the preview of the lesson ?

amsichani commented 4 years ago

Also, we're going to use this editorial process to help familiarize a newer member of the editorial team with the process and workflow. So @fdlaramee will be shadowing along as I work with you.

maehr commented 4 years ago

To my knowledge, I changed everything accordingly. Jekyll builds locally without warnings. Please tell me, if any other problem pops up or if I forgot to fix a problem.

maehr commented 4 years ago

@mdlincoln In the lesson template the endnote formatting is invalid. It is like this:

#### An End Note:

This is some text.[^1]
This is some more text.[^2]

##### Endnotes
[^1] Properly formatted citation using Chicago Manual of Style
[^2] Properly formatted citation using Chicago Manual of Style

Should be like this, with :, according to Markdown.

#### An End Note:

This is some text.[^1]
This is some more text.[^2]

##### Endnotes
[^1]: Properly formatted citation using Chicago Manual of Style
[^2]: Properly formatted citation using Chicago Manual of Style

amsichani commented 4 years ago

Hi @maehr , The lesson looks great to me. It’s a useful addition to our lessons, and I'm glad that you've taken the time to put it together! What really fascinates me is that you are (re)using existing tools and platforms to execute certain procedures which is a sustainable practice. I think there are only minor things to address before we send the lesson out for peer review. I have a couple of structural / typo remarks and a few technical points.

[x] p. 1 The (retro-)digitisation : not sure the meaning of this term is clear here. Lets wait to see what the peer reviewers feel about it.
[x] p.2 . (Batch processing): (Batch processing). The sentence finishes with a period after the parenthesis
[x] p.2 Scope: after the first sentence put : instead of .
[x] p. 3 Objectives : after the first sentence put : instead of .
[x] p.15 you note that that : omit that
[x] p.16 omit the –
[x] we managed to get the lesson's rendered version to appear correctly http://programminghistorian.github.io/ph-submissions/lessons/working-with-batches-of-pdf-files - thanks @mdlincoln !
[x] great catch on the footnotes re the template - we ll amend it ! thanks!
[x] the two images are not rendered properly, although they are where and how they should be. I m guess this is caused because there is no redirection to the lesson's image folder. Could you have a look?

Let me know if there is anything unclear. Given that my remarks are minor, we could try for a quick turnaround. Once you have made these revisions, I could then contact reviewers and move things forward.

maehr commented 4 years ago

Hi @amsichani Thanks for your feedback. I fixed everything mentioned above, retro-digitsation (which is a quite literal translation from the German Retrodigitalisierung) and images included.

I found another little issue within the YAML frontmatter of the lesson template. The original field should only be included in translations because it messes with the image path.

original: LEAVE BLANK
review-ticket: LEAVE BLANK
difficulty: LEAVE BLANK
activity: LEAVE BLANK
topics: LEAVE BLANK
abstract: LEAVE BLANK

amsichani commented 4 years ago

Fantastic @maehr ! I will now try to contact reviewers for your lesson and I ll get back to you here once I have some news . Stay tuned!

amsichani commented 4 years ago

@cderose and @jackpay have agreed to serve as our lesson reviewers 🎉 . They've agreed to a submission date for their reviews of 15 November 2019 (if not earlier). Do let me know if there is anything I can help with.

cderose commented 4 years ago

@maehr and @amsichani, thank you for the opportunity to review this lesson in advance. It will be a great addition to The Programming Historian. It provides a really nice walkthrough of the various steps a researcher might take when working with text files. I especially appreciated that it was structured around a case study since that informed what the driving goal was for each of the steps. The code snippets are also concise and accessible and will be terrific to have on hand.

For the notes that follow, I included my top 3 thoughts/suggestions first, after which I listed light edits or error messages I received. I would be happy to clarify or discuss any of them.

For reference, I was using a computer with MacOS Mojave.

For setting up the working directory, you could have users create a new directory within Downloads from the beginning like you do in P34. They could then download all of the PDF files for the first section there. This would ensure they're not touching other PDFs they might already have in Downloads. Alternatively, you might encourage users to empty the Downloads folder prior to the lesson and could have them wait to download DARIAH until after the initial PDF section is finished (otherwise, the GREP command in P24 will search through it and return not useful stuff).
For each code snippet (for example, P21), depending on the anticipated audience for this lesson, you might include a sentence that breaks down what the different pieces of the code are doing. If a more advanced shell user is assumed, you could include a note early on (maybe in P4) that encourages users to paste the code into something like https://explainshell.com/ to see how it's working if they have questions.
P36 & P37 - Would it be possible to include a pre-processed dataset for downloading? While it's very realistic that this work takes several hours, that asks a lot of users working through a tutorial since it essentially ties up their computer. If possible, it might be more effective to have users run the code on a subset of the documents to confirm they can OCR and extract text successfully. After that, if they could download the already fully processed dataset, they would be able to move on to the topic modeling portion without a signficant delay.

Table of contents Change Evaluation of to "Evaluate the Topic Models" to match the verb pattern in the rest of the sequence
P1 Motivation Might insert something like "often" into the first sentence to acknowledge that not all humanities work involves working with text-based sources: "Humanities scholars often work with..."
P1 Might rephrase the sentence that begins: "As a result, humanities scholars are increasingly being forced to..." Forced to makes it sounds like humanists are unwilling and uninterested. Given the audience for this lesson, it might be more compelling to frame the increase in data as an opportunity that requires/leverages/employs Distant Reading and other algorithmic tools to surface patterns
P3 Objectives Typo in the fifth bullet point: "Do all of the above..."
P3 Objectives Could make OCR software the hyperlinked text for Tesseract rather than languages
P5 Windows 10 Might include the full specification for absolute clarity: "Fortunately, since the Windows 10 Fall Creators Update"
P7 MacOS When I ran code, I received an error about accepting the Xcode license; if Xcode is indeed a dependency, it could be worth calling it out. Another error I received that seems specific to MacOS Mojave: Error: Xcode alone is not sufficient on Mojave. Install the Command Line Tools: xcode-select --install After installing the command line tools, I still got the following error: Error: The brew link step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink bin/2to3 Target /usr/local/bin/2to3 already exists. You may want to remove it: rm '/usr/local/bin/2to3' This error didn't prevent me from running any of the code in the lesson, but it might be worth mentioning in the main text or in a footnote of possible errors that can be ignored.
P10 Topic Modeling Holding control and double clicking didn't work on my Mac; you could add something along the lines of: If that doesn't work, go to Systems Preferences, click on Security & Privacy, and then click Open Anyway
P19 My output looked slightly different - it was nothing significant (I had an additional info line that read: INFO - Start processing 8 pages concurrent), but you could add a parenthetical to the caption that says (output might look slightly different)
P21 This is extremely helpful code to have on hand. Since in this particular case, the other documents are already OCRed, you might include a sentence that explains: All of our PDFs now have text that can be extracted. For future reference, to process all PDF files in your working directory at once, run... If you like, you can run this line now to see the error message that appears when you try to OCR text that has already been OCRed (press Control + C to stop the code if you don't want to wait for it to go through all of the pages).
P22 Could add a sentence to encourage users to look in their working directory and confirm that a new file has indeed been added
P23 This processed all of the files, but it also output an error message that might be worth calling out: Syntax Warning: Invalid Font Weight
P29 This line also returned Syntax Warning: Invalid Font Weight. By "image extraction," are you referring to the images of the scanned pages themselves and not images (like illustrations or photographs) that might be present in some of the pages? Originally, I though extracting images referred to the latter, but looking at the PNGs that this line of code returned, it looks like it took each PDF and turned each page into an image file. Since there are other Programming Historian lessons that talk about image extraction in the sense of extracting illustrations, you might add a note here that explains that in this context, extracting images from PDF files means turning each page into an image file.
P31 Might add a note that this step could take a few minutes
P36 Typo: "This will download all English..." You could include a parenthetical to say how many texts (340) are being downloaded as a way for users to make sure the code grabbed them all.
P37 Might specify that 340 text files will be generated, along with "list_of_files.txt," a document that includes the names of all of the extracted texts.
P38 For the last bullet point, you might add a sentence about how to interpret/evaluate/explain the results since each run could be different.
P39 In anticipation of things users might do accidentally, you could include a parenthetical note after "all 340 text files" to remind users not to include "list_of_files.txt." Also, is it possible to hyperlink "example Corpus" so that it goes to the stoplist you mention?
P40 Typo: "evaluate the Topic Model and its thirty topics."
P46 It would be helpful to have a sentence here that explains what's in those documents - why should someone read them, what do they say about PDFs (do they describe how they're created or do they focus on working with or archiving them)? Depending on the answers to those questions, this might be better as a footnote rather than as a concluding remark.

amsichani commented 4 years ago

Fantastic @cderose ! Many thanks for this. Waiting now for the review from @jackpay . Once received and in line with our editorial guidelines, I will aim to summarise the reviews as soon as possible and then @maehr could proceed with necessary revisions.

jackpay commented 4 years ago

As per the guideline I include a summary of my main observations, followed by any point specific observations, edits, points. Thank you very much for this opportunity.

Summary:

Navigating file structures: It might be worth spending a little time in intro paragraphs to talk about navigating file structures on the command line and getting people comfortable so they don't get lost later on.
e.g. cd ~/ takes you to your home directory, ./is your current directory, ~/Downloads takes you to Downloads. This can then be used as a lead in to creating a different working directory other than Downloads. i.e. Run this command mkdir ~/PDF2text - which will create a working directory in your home folder. Now navigate to this directory cd ~/PDF2text.

Topic Modelling: It might be worth mentioning a little on topic modelling early on. If anything just to establish that we are moving away from the typical way in which we as humans understand topics and one of documents generated (i.e. generative model for documents) from probabilistic distributions over words, which are defined as topics.

2) Possible edit: You don’t have access to commercial software, such as Adobe Acrobat Professional or Abbyy FineReader.

5) Perhaps establish a working directory that is not Downloads to establish a sensible process when they do this for themselves. For example, create one in their home directory.

8) Provide a link back to the 5 in case they skipped past or missed the links and advise regarding Linux on Windows.

11) Typo: Change: you will include one more file to our corpus To: you will include one more files to our corpus

12) Could use a little clarification or and a link to this sentence. 'To separate the two operations - processing PDF files and Topic Modelling - and avoid confusion, do this later in the lesson.'

18) In code snippet change cd ./Downloads to ~/Downloads - this ensure that wherever they are on the file system they will navigate to home -> Downloads.

21) Similar to point made in summary regarding navigating file structures and getting them used to moving around on the command line.

31) Maybe clarify that the wildcard operator is * and therefore *.png is saying all files with any file name that only have the suffix .png. May also want to specify a directory in the command ~/Downloads/*.png.

38) 'Only the frequency of words in a document or corpus is measured.' This is not true for LDA. The frequency of words matters greatly but more importantly it is capturing topics through co-occurrence of words. i.e. words appearing together across documents increase the likelihood of that topic existing in a document and ultimately the corpus.

'Each word has a probability to belong to one or more topics. The algorithm finds the corresponding probabilities of the individual words.' Technically all words appear in every topic with some probability but are higher in others and therefore define a topic.

40) Typo: ...explore and evaluate the Topic Model ant its thirty topics... Should be: ...explore and evaluate the Topic Model and its thirty topics...

amsichani commented 4 years ago

Thanks @jackpay for this! @maehr given that the two reviews are written in detail, I think there is no point for me summarising and rewriting all the points/ remarks. It d be good if you try and address all of them, as they are to the point. Do you have an estimated date when you will be able to provide an updated version of the lesson?

maehr commented 4 years ago

Thank you all very much. I agree, a summary is not necessary. I will have uploaded a revised version by Monday 2.12.2019 at the latest.

amsichani commented 4 years ago

Fantastic @maehr ! Looking forward to the edited version of the lesson . If you have any questions , please don't hesitate to ask me here, or if there is anything you need to clarify with the reviewers.

maehr commented 4 years ago

@cderose I have implemented most of the amendments. I am in contact with the ILO archive to see if we can offer a pre-processed corpus for download. @jackpay I have implemented most of the amendments.

maehr commented 4 years ago

@amsichani The ILO archive has not yet replied to me regarding a publicly distributable pre-processed dataset (see @cderose commentary below). Otherwise the article has been revised. How should I proceed?

3. P36 & P37 - Would it be possible to include a pre-processed dataset for downloading? While it's very realistic that this work takes several hours, that asks a lot of users working through a tutorial since it essentially ties up their computer. If possible, it might be more effective to have users run the code on a subset of the documents to confirm they can OCR and extract text successfully. After that, if they could download the already fully processed dataset, they would be able to move on to the topic modeling portion without a signficant delay.

amsichani commented 4 years ago

Hi @maehr , sorry for the delay (I was on strike). Many thanks for the updated version of the lesson --looks great overall!

One thing I d say it would be great to add is a couple of clarifying instructions that @cderose mentions, in order to navigate the user through a complex procedure

P37

Might specify that 340 text files will be generated, along with "list_of_files.txt," a document that includes the names of all of the extracted texts.

P38

For the last bullet point, you might add a sentence about how to interpret/evaluate/explain the results since each run could be different.

P39

In anticipation of things users might do accidentally, you could include a parenthetical note after "all 340 text files" to remind users not to include "list_of_files.txt." Also, is it possible to hyperlink "example Corpus" so that it goes to the stoplist you mention?

I would say lets wait a couple more days for the ILO to reply - could you send them a reminder? In the meantime, while you finalise the lesson, I'll need an one-sentence bio and orcid from you. I am then preparing the files for @svmelton to PR over to the other repository. I am optimist that the lesson could go live before xmas.

maehr commented 4 years ago

Hi @amsichani

In P12 I tried to help the user navigate the filesystem. Do you expect more specific explanations?

P37 and P39 The intermediate file "list_of_files.txt" is not generated anymore and I hyperlinked the stopwords. I mention the 340 files in P38.

P39 I included remarks regarding topic modeling and "stable" results in P40 and P41.

ILO got back to me with more specific questions. Hopefully we can put the dataset on Zenodo. I should be able to get a definitiv answer before Christmas.

My ORCID is 0000-0002-1367-1618. My bio: "Moritz Mähr investigates the history of computers and migration at ETH Zurich in Switzerland."

amsichani commented 4 years ago

Fantastic @maehr ! Zenodo should work fine and we are currently exploring this option for hosting large assets. , so this might be a interesting case study for us. I will now continue with transferring the lesson to our managing editor and do please let us know about how you progress with ILO.

amsichani commented 4 years ago

Hi @svmelton ,

Here are the lesson files you'll need:

lesson file - /lessons/working-with-batches-of-pdf-files.md
images folder - /images/working-with-batches-of-pdf-files
gallery image - /galleryworking-with-batches-of-pdf-files.png
gallery original - /gallery/originals/working-with-batches-of-pdf-files.png

Please note that there isn't an asset folder for this lesson ; instead we are still waiting for a dataset to be deposited on Zenodo. @maehr will let us know when is up and I guess he will also need to update the lesson accordingly.

also note:

- name: Moritz Mähr
  team: false
  orcid: 0000-0002-1367-1618
  bio:
      en: |
        Moritz Mähr investigates the history of computers and migration at ETH Zurich in Switzerland.

Let me know if I'm missing anything.

amsichani commented 4 years ago

Hi @svmelton do we have a timeline on the publication of this lesson from your part (I know you have a lot in your plate right now in terms of pubs)? Also, @maehr do we have an update on the ILO front? Thank you all for your hard work on this and lets try to publish before xmas(?!)

svmelton commented 4 years ago

Hi @amsichani—I'm just waiting on the dataset, and then we can move forward with publication. Thanks!

amsichani commented 4 years ago

Fantastic @svmelton - we are now waiting for an update from @maehr on the ILO dataset so we can move forward.

maehr commented 4 years ago

Fantastic @svmelton - we are now waiting for an update from @maehr on the ILO dataset so we can move forward.

I sent out a reminder one week ago (and again today). I hope I get an answer before xmas.

maehr commented 4 years ago

@svmelton The ILO got back to me and I was able to publish the dataset on Zenodo https://doi.org/10.5281/zenodo.3582736. I added the link to the dataset to the lesson. IMO we can move forward.

PS: Sorry, my commit message closed the issue automatically .

amsichani commented 4 years ago

Many thanks for this @maehr ! Great work! @svmelton do let me know if you need anything else from me at this point!

svmelton commented 4 years ago

Excellent! I'll work on it this weekend and ping y'all if I need anything.

svmelton commented 4 years ago

Hi all—I've just run through the lesson, and it looks good! @amsichani—we're just missing a bit of metadata (reviewers, editors, review ticket, difficulty, activity, topics, abstract, avatar_alt).

amsichani commented 4 years ago

Happy New Year everyone! Many thanks for the heads up @svmelton ! I m now working on these -- @maehr could you provide me a small lesson's abstract / description (have a look here https://programminghistorian.org/en/lessons/ )?

maehr commented 4 years ago

Happy New Year everyone! Many thanks for the heads up @svmelton ! I m now working on these -- @maehr could you provide me a small lesson's abstract / description (have a look here https://programminghistorian.org/en/lessons/ )?

@amsichani I have tried to capture the essence. Feel free to enhance or correct my version.

Learn how to perform OCR and text extraction with free command line tools like Tesseract and Poppler and how to get an overview of large numbers of PDF documents using topic modeling.

amsichani commented 4 years ago

Many thanks @maehr . @svmelton I have now updated the lesson with the necessary metadata (I am not sure I get the avatar_alt) and I think we are ready to go - many thanks for your cooperation!

spapastamkou commented 4 years ago

avatar_alt = the title of the avatar image of the lesson once the image will be selected by the editor (as per this PR)

amsichani commented 4 years ago

Hi @svmelton , all metadata is now in place.

svmelton commented 4 years ago

Thanks so much, @amsichani! Exciting news: we'll be able to pilot our new external copyediting with this piece! We're getting it set up, but I'll let you know ASAP when we have a timeline. Thanks for everyone's patience; I'm excited to have this piece as our first professionally copyedited publication!

acrymble commented 4 years ago

Thanks for your patience everyone. The copyeditor has now had a chance to look through the text and make suggestions based on the styleguide. I've attached the PDF with comments to this ticket. Her instructions include:

I've underlined in red instances where changes are needed or suggested. I've used yellow notes to give the detail on each instance. One note is purple and that's because it includes a suggestion about adding a preferred written-out date format to the style guide (just to flag to your attention).

This is our first copyedited lesson, so I think the best thing is for @amsichani and @maehr to incorporate the suggestions and discuss between themselves anywhere they disagree or need further conversation. Once you're both happy you can proceed with the rest of the publication process. Working with batches of PDF files _ Programming Historian - copy edit 3.pdf

maehr commented 4 years ago

Hi @amsichani and @acrymble I really love the change requests made by the copyeditor. As a non native speaker this is a blessing! I corrected everything according to the notes. The last section (Mueller Report) needs some more attention. Thanks a lot PS: I forgot that I can push my changes directly and opened (and closed) a pull request. Sorry for the inconvenience.

amsichani commented 4 years ago

Many thanks for your patience and cooperation @maehr and @svmelton & @acrymble for navigating us through the copyediting process -- this is exciting! I am happy with the changes that @maehr has incorporated and if there is no other comment, I think @svmelton we are ready to go live!

svmelton commented 4 years ago

Fantastic! I will work on this over the next couple of days and ping you if I have any questions. :)

svmelton commented 4 years ago

And we're published! Thanks to everyone for your work, I'm excited to see this live!

programminghistorian / ph-submissions

Review Ticket: Working with batches of PDF files #258

Anti-Harassment Policy