Rewrite assign_placenames robot

The existing assign_placenames step in the gisAssemblyWorkflow tries to assign GeoNames IDs to subjects for places and stores them in the Cocina. It does this using the Gazetteer class that looks up names in a CSV file.

Work on #744 led to the conclusion that:

It's not sustainable to keep the CSV mapping up to date with all the place names that will occur.
GeoNames IDs would be difficult to lookup automatically in the GeoNames web service due to too many false positives.
Only 58% of GIS items have GeoNames IDs and they aren't used in GeoBlacklight or elsewhere in SUL access systems.

However, the place names do appear to mostly map to Library of Congress Name Authority headings. 95% of all the GIS items have subject place names that are present in the LC Name and Subject Authority File.

We would like to update assign_placenames to ensure that the place names are valid, and add the URI for the authority record to the Cocina. This will ensure that:

GIS items related to a place colocate in search results when faceting in EarthWorks and SearchWorks
The id.loc.gov URI can be used to look up links to Wikidata, GeoNames, etc to add additional information in the future (e.g providing descriptions or images for places in GeoBlacklight in the future).

So, this ticket is to reboot assign_placenames to:

Remove the CSV and all functionality related to it
Modify the Gazetteer class so that it looks up a place name in id.loc.gov, finds an exact match Name or Subject Authority record, and returns the id.loc.gov URI.
Adds the id.loc.gov URI to the place subject Cocina.

The subject should have a uri and source added:

{
  "value": "Finland",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/names/n79065711",
  "source": {
    "code": "lcnaf",
    "uri": "http://id.loc.gov/authorities/names/"
  }
}

Some of the place names have been found in the subject authority file. You can identify these because the will have sh in their ID:

{
  "value": "Arctic Ocean",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/subjects/sh85006951",
  "source": {
    "code": "lcsh",
    "uri": "http://id.loc.gov/authorities/subjects/"
  }
}

An example of looking up a heading using LC's id.loc.gov service can be found in this Jupyter Notebook. The Python implementation is included here, but it should map easily to Ruby. The search results are available in Atom XML. The JSON format didn't make immediate sense, but you are welcome to try to use that instead if you want.

import requests
from xml.etree import ElementTree

def lookup_name(name):
    url = "https://id.loc.gov/search/"
    params = {
        "q":  [
            f'"{name}"',
            'rdftype:Authority'
        ],
        "format": "atom"
    }
    resp = requests

    resp = requests.get(url, params)
    resp.raise_for_status()

    doc = ElementTree.fromstring(resp.content)
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    for entry in doc.findall('atom:entry', ns):
        title = entry.find("atom:title", ns).text
        uri = entry.find("atom:link", ns).attrib["href"]

        # If the strings match return it.
        # Note: some unauthorized headings have dashes in the URI and we want to ignore those
        # prefer https://id.loc.gov/authorities/names/n79065711 to https://id.loc.gov/authorities/names/n79065711
        if title == name and '-' not in uri:
            return uri

    return None

lookup_name('Sri Lanka')

sul-dlss / gis-robot-suite

Rewrite assign_placenames robot #822