Extracting Illustrated Pages from Digital Libraries with Python

programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons

http://programminghistorian.github.io/ph-submissions

138 stars 112 forks source link

Extracting Illustrated Pages from Digital Libraries with Python #193

Closed alsalin closed 5 years ago

alsalin commented 6 years ago

The Programming Historian has received the following tutorial on 'Extracting Illustrated Pages from Digital Libraries with Python' by @StephenKrewson . This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/extracting-illustrated-pages

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

alsalin commented 6 years ago

@StephenKrewson we're up! i hope to get through the whole thing in the next week and type up some initial comments.

alsalin commented 6 years ago

@StephenKrewson What fun! I ran this on my mac with no issue though I did have some suggestions for ease-of-use and sustainability/accessibility as well as a few copyediting suggestions. I mostly stuck to ensuring the lesson would run well and the main instructions were clear. I have to say I really like how you've connected this lesson to previous ones for additional information and loads of links to additional resources for each of the tools.

P35: Add a line between commands "conda create..." and "source activate..."
P37: Here we install pip, but we are already referencing it earlier in P34, is that correct?
P45: Move the code to a code block rather than using an image to illustrate the code you want the user to run. I'd reserve screenshots for illustrating difficult to navigate interfaces or complex visual outputs of code, but not for code blocks themselves.
P47: I'd clarify just by adding a statement to navigate in the terminal to the folder the user just downloaded and then run the command, which btw, for me was "jupyter-notebook" for reasons I can't pretend to understand. Took me a minute to figure that one out but I had to go into the jupyter files and anyways, may be a useful note
P55: Remnants of my technical communications background, a suggestion on listing menu directions: Cell > Run all
P57: HT section; do you want the user to run the notebook?
P58-59: You give detailed instructions on the HT sign up but not here, I'd just add a line that folks have to sign up and verify their email to make the sections match.
P68: There aren't really instructions on how to make collections following the link you provide.
P77: First line "... than are shown"? I think you could connect the next sentence better to this one as well as it feels a little fragmented.
P91: Suggestion> "The idea is to get all the good data we can, Not to clean up inconsistencies or gaps in the item metadata."
In general: consider using some alerts for your important notes to call attention to them.

Let me know if you have any questions about any of these suggestions! Once we complete this initial round of revisions, we'll invite some reviewers to take a look.

StephenKrewson commented 6 years ago

Awesome! Really grateful for this feedback @alsalin. I should have these done shortly.

StephenKrewson commented 6 years ago

@alsalin Edits done, a few alerts added. Thanks!

alsalin commented 6 years ago

Everything is rendering well now and looks like it's in great shape for us to start the open peer review. I'll update soon when reviewers are confirmed.

alsalin commented 6 years ago

@cderose has agreed to review this piece with a deadline of 11/1/18 and I'm still waiting on another reviewer to confirm their availability to review.

I've lifted this nice overview of the review process from Amanda Visconti as I think it summarizes the process quite nicely:

"Members of the wider community are also invited to offer constructive feedback here on this GitHub ticket, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (https://programminghistorian.org/editor-guidelines#anti-harassment-policy). Unlike other peer review processes, we're focused on maximizing the potential of each lesson through kind and generous discussion. (We do not solicit reviews to judge whether a tutorial is “good enough” to be published.) Our guidelines explains what to comment on as a reviewer, and our philosophy about the spirit in which you provide feedback.

Formal reviewers will be credited for their work at the top of the published lesson page, and informal reviewers may be credited elsewhere in the lesson.

We will ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred."

alsalin commented 5 years ago

@statsmaths has agreed to be our second reviewer for this lesson with a due date of 11/19/18.

Many thanks to our reviewers for assisting us with this lesson.

As a note - we should wait until all reviews are in before making any changes to the lesson.

statsmaths commented 5 years ago

Thanks to @StephenKrewson for putting this lesson together. I think it is very important to engage with visual materials. Showing how to grab image data from HT and IA is a great way to get more DHers involved with image corpora. Overall, the notebooks themselves are great resources and easy to use (though you should make sure to clear the output before publishing the notebook files; the version online have some output and AssertionErrors in them, which could be confusing). I do, however, think the lesson itself needs some work to get it into a form that would be most useful for people trying to learn this material.

Most of my comments about the lesson center on the intended audience level. The lesson floats back and forth between talking to a complete beginner and addressing someone who is very familiar with Python. In P13, it says that:

You need to know the basics of how to use the command line and Python. You should understand the conventions for comments and commands in a command line tutorial.

If that is the intended audience, it would seem that you could cut out all of the information about setting up Miniconda, navigating the terminal, and how Jupyter notebook work. Anyone who knows the basics of Python and the command line should be very familiar with these already (and a link to an external resource should serve anyone who needs a refresher).

On the other hand — if you want the resource to be accessible to a wider audience — it would be necessary to greatly simplify the introduction and avoid unneeded prerequisites and details. For example, you should describe what an 'API' is and avoid things such as needing to run code in the terminal. Python modules can be installed using the Anaconda GUI; otherwise, you could still instruct users to run commands in the terminal, but there is still no need to know anything about the terminal beyond how to open it. Anaconda can be started through the Anaconda Navigator and users can navigate to the downloaded files within the navigator. Also, there is no need here to make use of Python environments, which add a significant level of complexity.

Regardless of the audience in mind, I would also suggest streamlining out all of the off-handed comments about programming style in general. Examples of things that could be removed and make the piece easier to read: the comment (P10) about GitHub issues, syntax highlighting (P14), the Python REPL (P47), and the general usage of git (P64).

Finally, I found the "How Are Visual Features Obtained?" (P20-P33) very difficult to understand and slightly out of place. Even after several re-reads, I cannot figure out what is meant by the first two sentences in the second paragraph:

HathiTrust makes a field called htd:pfeat available for many of its public-domain texts. This field’s type is list and it exists within a Python object that is associated with each volume

Perhaps this would make more sense at the end of the piece one you've introduced the code and data? At the very least it should come after all of the set-up material.

Suggested steps: In summary, I think the general purpose of the tutorial is great and the notebooks are excellent resources. The next step would be to settle on what audience the tutorial is aimed at and to then make significant cuts to the existing material. If @StephenKrewson wants to keep the audience level to experience Python users, I think a good outline would be:

Introduction & Goals (current P1-P6)
Set-up (just a few paragraphs explaining how to download the notebooks, what Python libraries are needed, and how to get access keys for the two repositories)
A short added section that demos the functions before the deep-dive into the functions and their motivations
The current second half of the tutorial, P66-103

StephenKrewson commented 5 years ago

@statsmaths -- wow, thank you, Taylor! This is so thorough and clarifying. Great points. Hope to incorporate them soon.

I think I have a way to better explain HT/IA visual features by way of a comparison with Europeana/LOC newspaper databases (with detailed layout metadata for illustration regions in METS/ALTO format) and, at the other extreme, raw scan data that has not been OCR'd or OLR'd (with docWorks, etc.) at all.

This lesson is aimed at something in the middle: digital libraries with limited or non-localized visual metadata.

I'll also do a better job breaking down HathiTrust's metadata objects. Given the space that will be taken up by additions and clarifications such as these, I think the better strategy is to focus on the experienced Python user, as you suggest in your helpful closing outline.

alsalin commented 5 years ago

@statsmaths thanks very much for your thoughtful review!

Just a few reminders - we have one more review out. Once the second review is in, I'll summarize the reviews and provide some guidance on revisions at that stage. @stephenkrewson - before you make any adjustments to your lesson, let's wait for everything to come in and for me to have a chance to review the feedback as well.

cderose commented 5 years ago

I second the thanks for this lesson, @StephenKrewson. By tackling both HathiTrust and Internet Archive, the lesson opens up so many possibilities for creating image corpora — I'm excited to start mining illustrations! I appreciate that the introduction clearly sets out what the lesson will and won't help with, and the links throughout to other materials (PH lessons, documentation, etc.) are extremely helpful. The Jupyter notebooks ran smoothly on my macOS with High Sierra (I second @statsmaths note about clearing the output to avoid potential confusion). The notebooks are a terrific takeaway resource from the lesson.

Since Taylor already posed questions around audience, I'll center mine around workflow. I wonder how the lesson would read if all of the HT steps were bundled together, followed by all of the IA steps, particularly in the middle section (around P49-77). This streamlining could help readers focus on working fully through the idiosyncrasies for one collection before turning attention to the other, and then finishing with the section where you pull everything together and highlight the similarities and differences. It could also help with skimmability, if someone turns to the lesson purely for HT or IA.

Separating the HT and IA sections might also give a different tempo to the lesson. I opened the Jupyter notebooks at P49 as instructed, but I didn't actually work through them until I reached Next Steps at P101 because I wasn't sure if there was more instruction I needed first. If all the HT materials are grouped together, you could then encourage readers to run through the HT notebook's cells before turning to IA (and then similarly, you might have a note to run the IA notebook before P77). This could help contextualize the deeper dive into the code that follows, and it would mean readers see the code in action and have that very rewarding "I just extracted over 100 images in minutes" feeling that much sooner.

And then here are few light copy edits/suggestions:

P22 "In practice, there are quite a few"
P42 move this paragraph to before the code block so it's in mind before executing the code
P51 I know the Jupyter section may be condensed depending on anticipated audience, but if this screenshot stays in, you might swap it with one that only shows the json file and two notebooks since that's what readers will have in their directory at that time
P78 "in the subdirectories than shown"
P80 and 82 are great illustrations of what the lesson achieves; you might consider moving them up toward the beginning section as a preview of what's to come
P94 "keeps track of the num*eric page"
P94, 97, and 100 look like they're missing the code block styling

I'm happy to clarify or expand on any of the above and am super excited for this lesson, Stephen. Thanks to @alsalin for getting all of us together!

alsalin commented 5 years ago

many thanks to @cderose and @statsmaths for these timely and thoughtful reviews!

@StephenKrewson, responding to Taylor's point about audience, where are you landing now on whether the piece should be geared towards beginners or experienced users? Full disclosure - I was a python newbie several years ago but don't do anything with it now. I felt I was born with the command line knowledge in the womb though, so I felt comfortable navigating that. With previous python experience and command line experience, I felt the tone was quite ok as I'm familiar with the technical background but don't have enough strong python experience with all these new toys (kids these days! I'm aging myself and my python knowledge). I am thus grateful for the expertise of our reviewers in this regard and should you choose to go one way or another, the lesson may have to be significant edits.

That being said, I wonder if it is possible to strike a middle tone. I finished a summary of a community survey that we began some months ago and many readers wanted lessons or series of lessons that span technical levels if possible. Given that some readers look over a lesson quickly as a reference and other users are using it to learn something for the first time - I wonder if keeping the "beginners" material would be ok while streamlining a path for those more knowledgable at the beginning with maybe quick access anchor links. Perhaps giving a path at the beginning to jump straight to the info that Taylor suggests. What do you think? I'm open to all approaches.

As someone who loves streamlining, I appreciate @cderose's comment about bundling HT and IA sections together and of course the line edits are most appreciated. I would recommend following Catherine's suggestion of bundling the sections as I agree that it changes the flow of the lesson to be a little clearer.

Other than the question of audience, I think that the comments here are straightforward. What are your thoughts @StephenKrewson ?

A Note - after we decide on a path forward, the author will have about a month to complete edits and then I'll do another pass of the piece for copyediting. Many many thanks to @cderose and @statsmaths for your reviews and to @StephenKrewson for creating such a great lesson!

StephenKrewson commented 5 years ago

Hi All! Thanks again for the excellent feedback @statsmaths and @cderose. As always, thanks to @alsalin for excellent coordination and synthesis. I've been lingering with your respective ideas for the last few days.

At a high level, I want to follow Anandi's suggestion to find a middle way (and divide into HT/IA sections as Cathy advised). The multiple use cases described in PH's survey fit with my own experience with tutorials. I want both the minimal working example as a reference AND a sense of how a more experienced researcher has set up their development environment and utilized efficient programming techniques. Tutorials are a key venue (for me) for picking up this knowledge, given that I don't work at a job where more senior developers can review my code, etc.

However, my current strategy of interspersing the text with asides about programming style is not the best, as Taylor rightly points out. I need to simplify instead of including asides based on, in most cases, missteps that I made in setting up the project!

I'm going to write a revised and streamlined version of the lesson, broken up by IA and HT. It will assume Python and command line knowledge. I will clean the notebooks and make the copy edits. Then, I will note steps that call for further elaboration or raise interesting computational issues. I'll work with @alsalin to either reference outside resources or pull this material into alerts or parallel tracks (indicated paratextually/visually). Perhaps one track for beginners and another for more advanced users.

alsalin commented 5 years ago

@StephenKrewson This is great! I support the work and would be happy to look at any ideas as you put them together. I'm curious what @statsmaths and @cderose think of the revised approach?

cderose commented 5 years ago

The revised approached sounds great to me, too! Having even more discrete section headers could be one way of striking that middle ground for audiences with different levels of familiarity. For example, something like "Setting up Jupyter notebooks" could give contextual information to people who haven't run them before or would like a refresher, whereas such a section title would also call out that someone very experienced with Jupyter notebooks might jump ahead.

I second @alsalin and would be glad to chat further, if that's ever helpful, @StephenKrewson.

statsmaths commented 5 years ago

I also think that breaking up the tutorial in HT/IA sections is useful (and if one or the other breaks at some point in the future, at least the remaining section still holds). I think the suggestions by @cderose to include more discrete/informative section headers would also be very helpful.

Apologies for being a pain about this, but I'm still confused about what the level of this tutorial is going to be. In the revised approach outlined—the most recent post by @StephenKrewson—it says that the revision will "assumes Python and command line knowledge." Does this mean that the current instructions for installing Python and setting up the command line will be removed? If it is all removed, I'm not quite sure I understand how this is a middle road approach. If it's kept, I'm not sure why the revision says that this information is assumed.

In the end, I think you could go either way with the Python material (as intro or advanced; broken into as many different tracks as needed) and it would be a great tutorial. It is just still not clear to me what approach is being proposing. On the other hand, the command-line material seems tangential at best. I would either avoid it entirely or use the command line only as an easy way to install the additional Python libraries (everything else can easily be done with the simple Jupyter notebook GUI).

StephenKrewson commented 5 years ago

@statsmaths no apology needed--these are the right questions! Yes, much of the command-line material will be trimmed. Charlie Harper's lesson is a good model for how to expeditiously move through the instructions for Anaconda/environments. What do you think of it?

https://programminghistorian.org/en/lessons/visualizing-with-bokeh#getting-started

This lesson is rated as "medium" difficulty (fwiw, I think it's a fairly intense "medium"). On this scale, my lesson would also be "medium."

As I see it, there are two competing concerns. On the one hand, a minimal, platform-agnostic set of instructions for Python (focused on dependencies) has many advantages. It's more readable and won't easily fall out of date; advanced users can simply work within their dev environment of choice. It's also not the author's job to anticipate or troubleshoot issues with atypical setups. (or is it??)

On the other hand, I disagree with @statsmaths somewhat about the general ease of use of the Jupyter notebook GUI. Unless, I'm missing something, I think you're recommending that beginners simply launch Jupyter (installing all the packages globally) and then do all their data downloading/exploration within notebooks. However, it can be can be confusing to figure out how to access the file system from within notebooks, and this kind of project necessarily involves lots of file manipulation and storage (best done from the CLI). The perceived friendliness of Jupyter may also stem from prior knowledge of Python. Finally, Windows/*nix adds a whole other layer to this.

To sum up: once the decision is made to use Anaconda/Jupyter, a whole bunch of preliminary steps come into play. I'm tempted to just drop them, but I feel conflicted about not warning readers about some of the pitfalls that tripped me up.

tl;dr in the absence of One True Way to set up a Python environment but in recognition of the fact that Jupyter notebooks sit at the top of a fairly complex tool chain, I will pitch my lesson to medium to advanced Python users and use Charlie Harper's lesson as a model for how much time to devote to setup.

statsmaths commented 5 years ago

Thanks for the clarification. The reference to the bokeh lesson was particularly helped in understanding what you're proposing. Basically, it will assume a fairly good knowledge of Python and general computing, but also includes an (easily skipped) section on installing Python and the required packages. I think that's appropriate and makes sense as a plan going forward.

alsalin commented 5 years ago

Thanks all for the discussions and clarifications! I've spoken with @stephenkrewson and we've set a deadline for revisions for mid-December. Again, many thanks to the reviewers for your time and helpful comments!

alsalin commented 5 years ago

Looking good @StephenKrewson! I have a recommendation as I think forward to sustainability. In the introduction, while you mention that a subsequent lesson, it may be best to leave that section more vague - something like "this lesson does x, not y," not so much "this lesson does x and subsequent lessons will do y." Does that make sense? That way this lesson can stand on its own without confusing users about future lessons that we don't have timelines for. In the notes below, I point out a few more places where this generalized language would be good to have.

Some copyediting requests:

Check for line spacing - I noticed a few areas of lines without spaces between them and I couldn't tell if those were separate paragraphs or not.
Under "Goals": Save and iterate over a list of HT or IA volumes ids generated by a search - use IDs
In Software Requirements: (and this is a matter of personal style) you can remove the comma after "API endpoints will need to be updated"
In Software Requirements: "it's a good idea to check if the API or the wrapper library has updated or been disabled in some way" could be "it's a good idea to check if the API or the wrapper library has been updated or disabled in some way
In Software Requirements: "Many projects practice "semantic versioning," about which you can read more [here]." Could be better rendered for accessibility concerns to be something like "Many projects practice "semantic versioning," about which you can read more [at semver.org]."
In Suggested Prior Experience: For accessibility concerns, you may want to write out the lesson title in the last line of this section and then link the lesson title
In Comparison to Similar PH lessons: First paragraph, "--" could be —
In How are Visual Features Obtained?: You can remove comma from first paragraph after "The most recent documentation for the Data API describes htd:pfeat on pages 9-10"
In this same section, consider removing direct mention of future lesson in the sixth paragraph of this section.
In this section, paragraph starting with "In all likelihood, the IMAGE_ON_PAGE": In the second sentence, "This is a good segue" could be "This feature is a good segue". In this same paragraph, you have a statement that reads "I am compiling data on the version of Abbyy FineReader used in OCR-ing nineteenth century medical texts held in IA." I'm not sure this is helpful here since it only mentions your work at the time of writing and takes away from the flow of the paragraph
In this section, paragraph starting with "Part of the intellectual fun of this lesson": "uninillustrated" --> "unillustrated".
In this section, next paragraph: use lesson title instead of "please see [this lesson]" and then link lesson title. Same paragraph: "This ends up affecting" could be "These errors end up affecting..."
End of this section: remove direct mention of subsequent lesson. Also "This is not correct" could be "This functionality is not correct"
In section Install Miniconda: Use lesson title for link in "Download and install Miniconda [here]."
In this section, use alert for "Important!" paragraph
In Environments: use lesson name for link in sentence "A handy cheatsheet..."
In Conda Installs: "then press Enter" should be "then press Enter" where "Enter" is bold to indicate a keystroke
In Download Lesson Files: For "Simply download the following compressed [folder]" you can use something like "Simply download the following compressed folder from the [Programming Historian's GitHub Repo]"
In Get API Keys: HathiTrust: Use alert and new paragraph for "Careful!" statement in first paragraph, also use an alert for the subsequent note.
In Get Item Lists: Motivation: Last paragraph, "--" could be —
In Get Item Lists: HathiTrust: use the word "collection" for the link to the collections page
In this section before the screenshot: Remove line that read "This is a little bit tricky" and add to the previous sentence: "If you wanted to use the file from your own HT collection, you would navigate to your collections page and hover on the metadata link on the left to bring up the option to download as JSON as seen in the following screenshot"
In this section after image: "When the JSON file has downloaded" could be "When you have downloaded the JSON file"
In this section for image: add caption to act as alt-text
In Code Walk-through: JPEGS should be JPEGs and "ids" should be "IDs"
In this section, add captions for image to act as alt-text
In Shared Code, second paragraph: "ids" should be "IDs"; "id" should be "ID"
In Differences between HT and IA: Assembling img_pages list: "the main difference is" should this be "the main difference between HT and IT is"?
In this section, in paragraph "Once we have the file,": "This is useful because" should be "This function is useful because"
In Differences between HT and IA: Downloading: First sentence, you can remove the ",", also in paragraph "IA's Python library does not provide", "--" could be —
In this section in the paragraph beginning with "IA has been working on an alpha version": "This is a vast improvement" should be "This version is a vast improvement"
Next Steps: Would it be possible to phrase this section generally, without direct mention to your future lessons?

Let me know if you have any questions!

alsalin commented 5 years ago

Working on moving the following files from /submissions/ to /jekyll/, just fyi since i'm also publishing the lesson lessons/extracting-illustrated-pages.md images/extracting-illustrated-pages/ assets/lesson-files.zip gallery/extracting-illustrated-pages.png gallery/originals/extracting-illustrated-pages-original.png

alsalin commented 5 years ago

Hi all! we are live! https://programminghistorian.org/en/lessons/extracting-illustrated-pages ! Many thanks to @StephenKrewson for a fabulous lesson and to the reviewers @cderose and @statsmaths as well as @walshbr for helping with new publishing workflows. I encourage everyone to tweet out the lesson and share it among your networks to help spread the word about the resource. Thanks all!

StephenKrewson commented 5 years ago

Wonderful! My sincere thanks to @statsmaths and @cderose for their one-two punch re: organization and concision. The lesson is much clearer and slimmer thanks to you both! The detailed and responsive work of @alsalin has made it all possible. Very glad to see it go live. Thank you!

alsalin commented 5 years ago

I'll go ahead and close out this issue. Thanks again to everyone!