nationalarchives / hms-nhs-scripts

MIT License
0 stars 0 forks source link

Overview

These are the data processing scripts for the Zooniverse crowdsourcing project HMS NHS: The Nautical Health Service. HMS NHS is a Royal Museums Greenwich project, and part of the larger Towards a National Colletion project, Engaging Crowds.

Volunteers on HMS NHS transcribed pages from the Admissions Registers of the Dreadnought Seamen's Hospital. Each page was transcribed by multiple volunteers. These scripts reconcile the different transcriptions from each page into a single transcription. When they cannot find a reconciliation then they list all of the possibilities for a human moderator.

The general approach is to make use of Zooniverse's own Panoptes aggregation scripts for the main part of the reconciliation task, with hms-nhs-scripts doing additional data cleansing and formatting the data into two output forms:

  1. joined.csv A CSV file presenting the transcriptions row by row.
    • This file is for hand checking. Unreconciled transcriptions can be reconciled by a human moderator. Reconciled transcriptions can be spot-checked.
  2. mimsy.txt A text file suitable for ingest into Mimsy, RMG's cataloguing system.
    • This file is generated from joined.csv after hand corrections.

Example Use

Click here for more detail It is a good idea to install the pip dependencies inside an isolated environment such as [virtualenv](https://pypi.org/project/virtualenv/). You can use `requirements_all.txt` instead of `requirements.txt` if you want to match the environment even more precisely. You can get the Zooniverse project exports via the `Request new workflow classification export`, `Request new subject export` and `Request new workflow export` buttons in the `Data Exports` area of the Zooniverse Project Builder. You will need to download some different files, depending upon whether you are getting data from phase one or phase two of the project. Look at the `export` fields in `workflow.yaml` to see which files to download. Phase one uses `phase1` while phase two uses `phase2`. `extract.py` runs the Panoptes aggregation scripts, with a few cleanup interventions from hms-nhs-scripts. The output goes in `extraction`. You can observe the cleanups by comparing `extraction/..._extractor.csv.full` with `extraction/..._extract.csv.cleaned`. At the end it will report a number of exit codes: if any of these are not 0 then an error has occurred. `extraction/postextract_*.log` contains information about possible cross-references in the original source. You can check this by looking for the "from cell(s)" string in `text_extractor_*.csv.full`. Note that posssible cross-references are not deleted. `aggregate.py` creates `joined.csv`, which joins the transcriptions of the separate columns into rows and reports additional information about the reconciliation process such as which columns needed automatic reconciliation and cases where automatic reconciliation failed. At time of writing, we are running `aggregate.py` with `-t 0.3`. `-t` sets the threshold for accepting automatic resolution of text inputs. This ranges from 0 to 1, with lower numbers are more aggresive. This means that lower `-t` values will accept less "certain" resolutions as correct. These patterns are used by transcribers to indicate that they could not read some of the text. `extract.py`, `aggregate.py` and `mimsify.py` have several options. You can find out about these by running them with `--help`.

Correcting joined.csv

Automatic reconciliation is necessarily imperfect. You can control the "aggressiveness" of the reconciler by changing the --text_threshold and --dropdown_threshold parameters of aggregate.py: lower numbers are more aggressive. Greater aggression will result in more reconciled cells but also more incorrectly reconciled cells. Run with --help to see other options.

The recommended way to correct joined.csv is to open it in a spreadsheet. We used Google Sheets. You may run into some quirks if you use a different spreadsheet.

The original column gives you a link to the original page on Zooniverse, so that you can read it for yourself.

When correcting data, there are some formatting rules to follow:

mimsify.py checks for problems in its input, so if you make mistakes such as typing a date or a years at sea value in the wrong format, it will fix it if it can, and otherwise will tell you about it.

The Problems column tells you about missing data and fields that could not be autoresolved, or that need checking for other reasons, such as unusual inputs (for example, a date with the "day" part set to 0, which often indicates that the day is not given in the original document) or explict patterns of text that mean that the transcriber was uncertain about something (for example ..., \[ill\], (HMS Iphigenia?)). To find rows that need correction, use Ctrl-Down and Ctrl-Up to leap across empty cells in this column. If there are large numbers of problems, you could consider increasing the aggressiveness of the reconciliation.

Individual cells that need correction are often easily spotted. The following example is from number of days victualled (volume 1, p. 89, admission no. 7127):

6.0
----------
6.0 @2
70.0 @2

Here, the 6.0 at the top is the script's best guess at the correct value. The rows beneath the hyphens indicate that 2 transcribers thought the number was 6, and 2 transcribers thought the number was 70. Looking at the original page image shows that the correct value is 6, so you should delete the original value of the cell and replace it with 6.

Here is a second example from how disposed of (volume 1, p. 90, admission no. 7160):

<No best guess>
----------
To a/his ship cured
Request cured
Shipped

Here, the script has no guess at the correct value. The rows beneath the hyphens indicate that 1 transcriber thought it was To a/his ship cured, 1 thought it was Request cured and 1 thought it was Shipped (when a transcription was given only once, there is no @). Looking at the original page shows that the correct text is Request Shipped to go to his home (Cured) -- which was not an option in the dropdown used by the transcribers, so it is not surprising that they gave different answers! You should either choose one of the possible answers as the "correct" one, or else delete the original value of the cell and replace it with the actual text from the page.

Sometimes cells need correction because they contain text that looks like an indication of transcriber uncertainty (for example, ... or []). These will not stand out as much as the examples above, but should still look pretty odd when scanning by eye. It is just important to be aware that you need to look out for these kinds of things, as well as the more obvious cases -- both might exist in the same row.

The Autoresolved column lists fields where transcriptions disagreed and were successfully autoresolved. You can use this information to do spot checks that the resolver is performing correctly. If you find too many mistakes then you could consider reducing the aggressiveness of the reconciliation.

Note that the final few rows of the final page of the final volume (at the bottom of the CSV file) are blank. You will need to manually delete the various bad admission numbers/zeroes/Missing Entry values in these cells. (There are also blank rows to be found throughout phase2, but the scripts do a better job of finding and removing these for you.)

Generating mimsy.txt

Once you have corrected joined.csv, you can generate a file for Mimsy ingest by running mimsify.py to create output/mimsy.txt. Either put the corrected CSV file in the output directory, or use mimsify.py <location> to tell the script where the corrected file is. I recommend calling the corrected file corrected.csv, placing it next to joined.csv and running mimsify.py output/corrected.csv.

You are likely to see messages as mimsy.py runs. Warnings are for information: you may want to check what is happening, but the script can handle the input. Errors are a real problem that you should resolve. Both types of message will tell you where the problem is in the input file.

Run with --unresolved to allow unresolved fields and flatten them into a convenient format. If you run without this option then any unresolved fields will trigger an error message.

Run with --help to see other options.

About joined.csv

You can read about data transformations involved in creating joined.csv in DATA_README.md.

The columns in joined.csv are as follows:

ColumnHeaderDescription
AoriginalLink to the image of the original page on Zooniverse. Helpful for checking and reconciling transcriptions.
BsubjectThe Zooniverse subject id for the page.
CvolumeThe volume of the Admissions Registers that the page comes from.
DpageThe number of the page in the volume.
Eadmission numberReconciled transcription for each column of the original page.
Fdate of entry
Gname
Hquality
Iage
Jplace of birth
Kport sailed out of
Lyears at sea
Mlast services
Nunder what circumstances admitted (or nature of complaint)
Odate of discharge
Phow disposed of
Qnumber of days victualled
RProblemsRecords problems that need a human to fix them. Provides a minimum count for unresolved fields and also flags up when some of the fields are empty.
SAutoresolvedLists which fields in the row had to be reconciled by the script. All other fields should have total agreement among the transcriptions (after cleaning).
TRepoThe repository from which came the scripts which produced this row.
UCommitThe commit from which came the scripts which produced this row.
VArgsThe exact invocation of `aggregate.py` which produced this row.
W§°—’“”…;£ªéºöœüThis peculiar header exists to make sure that Google Sheets understands that the file's character set is UTF-8, not ASCII. Without this header, Google Sheets loses some of the information from the original transcriptions. There is no content in this column.

About mimsy.txt