Implement phase 2 - Githubissues

bogden1 commented 2 years ago

The scripts must be able to parse phase 2. This requires writing new entries in workflow.yaml for the phase 2 versions of the workflows. It also requires:

[x] Confirming that there is no overlap in workflow version numbers (or dealing with it, if there is) Update See below
[x] Confirming that the tranche support is working correctly (because phase 2 is ongoing and so must be handled in tranches) Update Decided to set this aside
[x] Convert dropdowns to text (should just be a workflow.yaml update) Update It was
[x] Generally checking for changes in transcription rules and data formats, and updating the scripts to handle this if need be. Hopefully will be nothing to do here, but it feels likely that a tweak or two will be required Update Made the cleaner more forgiving (8f63f75ef70f554d0e7083cbff584a4fc1dc7af5 [update -- this hash is wrong, sorry]) -- it now returns original values rather than blanking out. This is because e.g. age can be "6 weeks" rather than just a number. Update Otherwise just gave the Field Guide a quick once-over. Nothing else jumped out at me.

This is underspecified for now -- will be able to update it once I get to grips with phase 2. Estimate of 5 days for now -- could be better or worse than this.

Essential, because the scripts must be able to process all of the data from HMS NHS.

bogden1 commented 2 years ago

We could potentially agree to abandon reading in tranches. I would suggest checking the existing support first.

Update I think that the tranche support either works, or is very close to working. However, given time constraints, am opting to assume that it does not work.

bogden1 commented 2 years ago

One thing to watch out for here is the assumption that "short" dates should be converted to 19th century. See manual_date_fixups branch for the real story here.

Update Deleted the relevant rule in commit f8704c572c77fd89321f56d388e3ad94acdd05d4. See the log for justification.

bogden1 commented 2 years ago

Another thing to watch out for is classifications with the "wrong" workflow version. See #8 for more on this. It may be that the "right" way to deal with this is to specify the subjects belonging to a phase, or at least the volumes belonging to it, rather than relying on the workflow versions making the difference for us.

Update Implemented "volume selection" on the phase2 branch, which was merged to main in commit c77ba3f. This may still be pulling in workflow versions that should be left out, will need to check that with RMG.

bogden1 commented 2 years ago

Don't forget this comment in aggregate.py

if type(data['version']) is list: #Assume that labels are the same in all versions, just use the first file.
                                  #TODO: Add some code to confirm that the labels are the same in all versions.
                                  #      At time of writing, extract.py will ensure this for phase 1 exports from HMS NHS.

Update Addressed in branch config_identity, which has been merged to main. extract.py now knows how to check all config file types for identity. Various additions make this work OK for phase2 and mean that we are now checking things a bit more carefully than we were before. Also, I have realised (rather too late) that this issue simply does not apply to phase2 as it does not have any dropdowns for us to worry about the labels of.

bogden1 commented 2 years ago

Need to check that these are the correct workflows (these are id/name pairs): "18344": "creed" "18347": "of what port/port of registration" "18611": "admission number" "18612": "date of entry" "18613": "name" "18614": "quality" "18616": "age" "18617": "place of birth" "18618": "port sailed out of" "18619": "years at sea" "18621": "last services" "18622": "under what circumstances admitted (or nature of complaint)" "18623": "date of discharge" "18624": "how disposed of" "18625": "number of days victualled"

This is the list of currently-active workflows for phase 2, but need to know whether others have been used in the past or are likely to be used in the future.

Update Pretty comfortable with this list at this point, expect to confirm with RMG.

bogden1 commented 2 years ago

Need to update the mimisfier to generate for names from workflow.yaml rather than a hardcoded list.

Update Done: 3198fa0d21b6afc6b3728276d8b4bb37824c7bd6

bogden1 commented 2 years ago

As far as I can tell, we're doing an OK job at processing phase2 -- enough for a handover, anyway. Closing.

nationalarchives / hms-nhs-scripts

Implement phase 2 #14