sul-dlss / gis-robot-suite

Robots for GIS accessioning and delivery
Other
9 stars 4 forks source link

Rewrite assign_placenames robot #822

Open edsu opened 9 months ago

edsu commented 9 months ago

The existing assign_placenames step in the gisAssemblyWorkflow tries to assign GeoNames IDs to subjects for places and stores them in the Cocina. It does this using the Gazetteer class that looks up names in a CSV file.

Work on #744 led to the conclusion that:

However, the place names do appear to mostly map to Library of Congress Name Authority headings. 95% of all the GIS items have subject place names that are present in the LC Name and Subject Authority File.

We would like to update assign_placenames to ensure that the place names are valid, and add the URI for the authority record to the Cocina. This will ensure that:

So, this ticket is to reboot assign_placenames to:

  1. Remove the CSV and all functionality related to it
  2. Modify the Gazetteer class so that it looks up a place name in id.loc.gov, finds an exact match Name or Subject Authority record, and returns the id.loc.gov URI.
  3. Adds the id.loc.gov URI to the place subject Cocina.

The subject should have a uri and source added:

{
  "value": "Finland",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/names/n79065711",
  "source": {
    "code": "lcnaf",
    "uri": "http://id.loc.gov/authorities/names/"
  }
}

Some of the place names have been found in the subject authority file. You can identify these because the will have sh in their ID:

{
  "value": "Arctic Ocean",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/subjects/sh85006951",
  "source": {
    "code": "lcsh",
    "uri": "http://id.loc.gov/authorities/subjects/"
  }
}

An example of looking up a heading using LC's id.loc.gov service can be found in this Jupyter Notebook. The Python implementation is included here, but it should map easily to Ruby. The search results are available in Atom XML. The JSON format didn't make immediate sense, but you are welcome to try to use that instead if you want.

import requests
from xml.etree import ElementTree

def lookup_name(name):
    url = "https://id.loc.gov/search/"
    params = {
        "q":  [
            f'"{name}"',
            'rdftype:Authority'
        ],
        "format": "atom"
    }
    resp = requests

    resp = requests.get(url, params)
    resp.raise_for_status()

    doc = ElementTree.fromstring(resp.content)
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    for entry in doc.findall('atom:entry', ns):
        title = entry.find("atom:title", ns).text
        uri = entry.find("atom:link", ns).attrib["href"]

        # If the strings match return it.
        # Note: some unauthorized headings have dashes in the URI and we want to ignore those
        # prefer https://id.loc.gov/authorities/names/n79065711 to https://id.loc.gov/authorities/names/n79065711
        if title == name and '-' not in uri:
            return uri

    return None

lookup_name('Sri Lanka')
edsu commented 8 months ago

Much better title, merci!