srophe / caesarea-data

Data repository for Caesarea-Maritima.org
0 stars 2 forks source link

Fix post-processor generation of bibl URIs from Zotero URIs #147

Closed wlpotter closed 1 year ago

wlpotter commented 1 year ago

It looks like the regex for the bibl URI generation in the post-processor needs some work. The regex is too simple to catch and handle all the various formats of a zotero link.

Simplest option would be to make the regex more flexible in catching variants.

Another would be to have a check against an index of existing zotero URIs (maybe use the Zotero API to get a list of valid item keys?)

Needs more thought.

wlpotter commented 1 year ago

Looking back through the data from Box, there are a few difficult cases:

  1. bibls not from Zotero but with extraneous data after the bibl uri base
    1. Example: https://caesarea-maritima.org/bibl/364UZRVA/library
    2. Solution: take the substring after the bibl-uri base; then take anything before '/' (use substring-before-if-contains); then add back the bibl-uri-base. This should already be implemented in the script.
  2. Zotero URIs that lack the /items/{URI} string
    1. Example: https://www.zotero.org/groups/2320447/caesarea-maritima/Q4XVR2NU/item-list
    2. It might be worth having an index of Zotero URIs, maybe generated from a zotero2bibl instance. We could then iterate over the index until one of the entries matches as a substring of the ptr URI. This seems awfully convoluted for what is hopefully just an isolated incident.
    3. There might be a way to enforce type constraints that would throw an error if there's an invalid/malformed bibl URI. This is likely the easiest way to go as the processing scripts currently log errors to console, so we'd be able to pin point the (hopefully small number of) offending records
wlpotter commented 1 year ago

Cases:

  1. If it has the string "zotero" then it should be found by items\/.*, which will get the 'items/' string, the item key, and anything that may come after it. Extract using regex; substring-after items/; substring-before-if-contains / should result in just the item key.
  2. If it does not have the string "zotero", take the substring after the bibl-uri-base; then substring-before-if-contains / then add back bibl-uri-base.
  3. Enforce typing constraints to raise an error if the item key is empty. This may be simply a function for appending the bibl uri base to the item key where you pass the key and return an empty sequence if the string length is 0. Specifying the return type as xs:string should do the trick as a typing error will be raised if the key is an empty string.
wlpotter commented 1 year ago

I've updated the post-processor. Want to test this out with the new data to see if it works or not

wlpotter commented 1 year ago

Based on tests with the newest data, I have found no failures that slipped through; and on one occasion the typing constraints caught malformed input URIs. So I believe this is fixed.