srophe / syriaca-data

Repository for Syriaca.org TEI data, used by srophe-eXist-app.
4 stars 16 forks source link

Script to clean up links to deprecated URIs #507

Open nathangibson opened 8 years ago

nathangibson commented 8 years ago

We need a maintenance script to clean up any links to deprecated URIs and replace them with their redirects.

Deprecated person URIs can be found in either of 2 ways:

  1. All /TEI/text/body/listPerson/person/idno[@type="URI" and matches(.,"http://syriaca\.org")] elements in records in https://github.com/srophe/srophe-app-data/tree/dev/data/deprecated/. These also have revisionDesc[@status="deprecated"] OR
  2. /TEI/text/body/listPerson/person/idno[@type="deprecated"] elements in regular data folder. (But this will only get deprecated records that have a redirect, and I'm not sure I've caught them all because this was not in the script in the beginning.)

All URIs that are the target of a redirect can be found by looking in the deprecated folder https://github.com/srophe/srophe-app-data/tree/dev/data/deprecated/ (or revisionDesc[@status="deprecated"]) for /TEI/text/body/listPerson/person/idno[@type="redirect"].

If it helps, see the "syriaca:update-person-work-links" function in this script https://github.com/srophe/srophe-xQueries/blob/master/merge-persons.xql#L290

davidamichelson commented 8 years ago

@wsalesky I'm assigning this over to you as per our conversation please. Here are the specific details of what we need but see Nathan's notes above as well.

Please write a maintenance script that we will run periodically on all data.

The script should:

  1. Build list of all deprecated URIS and when applicable the redirect URIs which replace the deprecated URIs (not all deprecated URIs have a redirect).
  2. Deprecated URIs can be collected by from all in records in https://github.com/srophe/srophe-app-data/tree/dev/data/deprecated/ and are found at this path: /TEI/text/body/listPerson/person/idno[@type="URI" and matches(.,"http://syriaca\.org")]. These also have revisionDesc[@status="deprecated"]
  3. The redirect URIs can be collected in either of two places:

a. In the non-deprecated records that are themselves a redirect, you can find the URIS which redirect to them at /TEI/text/body/listPerson/person/idno[@type="deprecated"] In this case the URI of the record itself would be the redirect. or b. All URIs that are the target of a redirect can be found by looking in the deprecated folder https://github.com/srophe/srophe-app-data/tree/dev/data/deprecated/ (or revisionDesc[@status="deprecated"]) for the path /TEI/text/body/listPerson/person/idno[@type="redirect"].

  1. Once you have built the list of deprecated URIS, please use it to search for deprecated URIS in all non-deprecated records. NOTE: This search and the subsequent replace should exclude the path /TEI/text/body/listPerson/person/idno[@type="deprecated"] because we want that untouched.
  2. For all matches on a deprecated URI, please return the following:

a. if there is a redirect URI, replace the deprecated URI with that one b. If there is not a redirect URI, return for hand editing a list of all matches

Please let me know if this is clear or if any questions. A test record for this script is http://wwwb.library.vanderbilt.edu/exist/apps/srophe/person/1167

<note type="abstract" xml:id="abstract-en-1167">In hagiography: Bassus and his twin <persName ref="http://syriaca.org/person/1492">Suzanna</persName> were twin children of a Zoroastrian governor. 

In this example above uri person/1492 should be changed by the script to person/1168:

<note type="abstract" xml:id="abstract-en-1167">In hagiography: Bassus and his twin <persName ref="http://syriaca.org/person/1168">Suzanna</persName> were twin children of a Zoroastrian governor.