These are the data processing scripts for the Zooniverse crowdsourcing project HMS NHS: The Nautical Health Service. HMS NHS is a Royal Museums Greenwich project, and part of the larger Towards a National Colletion project, Engaging Crowds.
Volunteers on HMS NHS transcribed pages from the Admissions Registers of the Dreadnought Seamen's Hospital. Each page was transcribed by multiple volunteers. These scripts reconcile the different transcriptions from each page into a single transcription. When they cannot find a reconciliation then they list all of the possibilities for a human moderator.
The general approach is to make use of Zooniverse's own Panoptes aggregation scripts for the main part of the reconciliation task, with hms-nhs-scripts doing additional data cleansing and formatting the data into two output forms:
joined.csv
A CSV file presenting the transcriptions row by row.
mimsy.txt
A text file suitable for ingest into Mimsy, RMG's cataloguing system.
joined.csv
after hand corrections.git clone https://github.com/nationalarchives/hms-nhs-scripts.git
exports
directory: cd hms-nhs-scripts; mkdir exports
pip install -r requirements.txt
hms-nhs-the-nautical-health-service-subjects.csv
(link next to Request new subject export
button)hms-nhs-the-nautical-health-service-workflows.csv
(link next to Request new workflow export
button)Request new workflow classification export
buttonexports/
directory../extract.py phase1
. This may take a few hours. Run as nice ./extract.py phase1
if you don't want it to dominate your computer's resources.extract.py
by committing that information to the repository. extract.py
itself will tell you how to do this.extraction/postextract_*.log
to see possible cross-references in the original input data.joined.csv
: ./aggregate.py -t 0.3 phase1
. This may take several minutes. joined.csv
will appear in the output
directory.joined.csv
is safe to open in certain spreadsheet software: ./misc_scripts/maxcolwidth.sh
joined.csv
by hand (see Correcting joined.csv, below)mimsy.txt
: ./mimsify.py phase1
(see Generating mimsy.txt, below)joined.csv
Automatic reconciliation is necessarily imperfect. You can control the "aggressiveness" of the reconciler by changing the --text_threshold
and --dropdown_threshold
parameters of aggregate.py
: lower numbers are more aggressive. Greater aggression will result in more reconciled cells but also more incorrectly reconciled cells. Run with --help
to see other options.
The recommended way to correct joined.csv
is to open it in a spreadsheet. We used Google Sheets. You may run into some quirks if you use a different spreadsheet.
The original
column gives you a link to the original page on Zooniverse, so that you can read it for yourself.
When correcting data, there are some formatting rules to follow:
Aug 07 1872
: the month name is always the three letter abbreviation, single-digit days must have a leading zero and the year must be written with all four digits.years at sea
column. Numbers in years at sea
must be written with two digits for the integer part, with the two types of service separated by ;
. For example, if the patient had spent 12 years in the Navy and 6 years and 3 months in the merchant service, this should be written 12; 06.25
.mimsify.py
checks for problems in its input, so if you make mistakes such as typing a date or a years at sea value in the wrong format, it will fix it if it can, and otherwise will tell you about it.
The Problems
column tells you about missing data and fields that could not be autoresolved, or that need checking for other reasons, such as unusual inputs (for example, a date with the "day" part set to 0, which often indicates that the day is not given in the original document) or explict patterns of text that mean that the transcriber was uncertain about something (for example ...
, \[ill\]
, (HMS Iphigenia?)
). To find rows that need correction, use Ctrl-Down
and Ctrl-Up
to leap across empty cells in this column. If there are large numbers of problems, you could consider increasing the aggressiveness of the reconciliation.
Individual cells that need correction are often easily spotted. The following example is from number of days victualled
(volume 1, p. 89, admission no. 7127):
6.0
----------
6.0 @2
70.0 @2
Here, the 6.0
at the top is the script's best guess at the correct value. The rows beneath the hyphens indicate that 2 transcribers thought the number was 6
, and 2 transcribers thought the number was 70
. Looking at the original page image shows that the correct value is 6
, so you should delete the original value of the cell and replace it with 6
.
Here is a second example from how disposed of
(volume 1, p. 90, admission no. 7160):
<No best guess>
----------
To a/his ship cured
Request cured
Shipped
Here, the script has no guess at the correct value. The rows beneath the hyphens indicate that 1 transcriber thought it was To a/his ship cured
, 1 thought it was Request cured
and 1 thought it was Shipped
(when a transcription was given only once, there is no @
). Looking at the original page shows that the correct text is Request Shipped to go to his home (Cured)
-- which was not an option in the dropdown used by the transcribers, so it is not surprising that they gave different answers! You should either choose one of the possible answers as the "correct" one, or else delete the original value of the cell and replace it with the actual text from the page.
Sometimes cells need correction because they contain text that looks like an indication of transcriber uncertainty (for example, ...
or []
). These will not stand out as much as the examples above, but should still look pretty odd when scanning by eye. It is just important to be aware that you need to look out for these kinds of things, as well as the more obvious cases -- both might exist in the same row.
The Autoresolved
column lists fields where transcriptions disagreed and were successfully autoresolved. You can use this information to do spot checks that the resolver is performing correctly. If you find too many mistakes then you could consider reducing the aggressiveness of the reconciliation.
Note that the final few rows of the final page of the final volume (at the bottom of the CSV file) are blank. You will need to manually delete the various bad admission numbers/zeroes/Missing Entry values in these cells. (There are also blank rows to be found throughout phase2, but the scripts do a better job of finding and removing these for you.)
mimsy.txt
Once you have corrected joined.csv
, you can generate a file for Mimsy ingest by running mimsify.py
to create output/mimsy.txt
. Either put the corrected CSV file in the output
directory, or use mimsify.py <location>
to tell the script where the corrected file is. I recommend calling the corrected file corrected.csv
, placing it next to joined.csv
and running mimsify.py output/corrected.csv
.
You are likely to see messages as mimsy.py
runs. Warnings are for information: you may want to check what is happening, but the script can handle the input. Errors are a real problem that you should resolve. Both types of message will tell you where the problem is in the input file.
Run with --unresolved
to allow unresolved fields and flatten them into a convenient format. If you run without this option then any unresolved fields will trigger an error message.
Run with --help
to see other options.
joined.csv
You can read about data transformations involved in creating joined.csv
in DATA_README.md.
The columns in joined.csv
are as follows:
Column | Header | Description |
---|---|---|
A | original | Link to the image of the original page on Zooniverse. Helpful for checking and reconciling transcriptions. |
B | subject | The Zooniverse subject id for the page. |
C | volume | The volume of the Admissions Registers that the page comes from. |
D | page | The number of the page in the volume. |
E | admission number | Reconciled transcription for each column of the original page. |
F | date of entry | |
G | name | |
H | quality | |
I | age | |
J | place of birth | |
K | port sailed out of | |
L | years at sea | |
M | last services | |
N | under what circumstances admitted (or nature of complaint) | |
O | date of discharge | |
P | how disposed of | |
Q | number of days victualled | |
R | Problems | Records problems that need a human to fix them. Provides a minimum count for unresolved fields and also flags up when some of the fields are empty. |
S | Autoresolved | Lists which fields in the row had to be reconciled by the script. All other fields should have total agreement among the transcriptions (after cleaning). |
T | Repo | The repository from which came the scripts which produced this row. |
U | Commit | The commit from which came the scripts which produced this row. |
V | Args | The exact invocation of `aggregate.py` which produced this row. |
W | §°—’“”…;£ªéºöœü | This peculiar header exists to make sure that Google Sheets understands that the file's character set is UTF-8, not ASCII. Without this header, Google Sheets loses some of the information from the original transcriptions. There is no content in this column. |
mimsy.txt