Rework personnel profile scraper to address cases where a person is affiliated with more than one org

I've ended up making a bunch of changes as part of this work to address other issues that had been lingering or that I encountered along the way. I've reworked the process now into a couple of different parts. The first stage builds what amounts to an artificial sitemap from the USGS Staff Profiles inventory. I cache a raw list of profile names (last part of the URL) scraped from the inventory to the item talk page of the source item in order to maintain a historic record of points in time where this list changes.

I also started using a message queuing strategy again with a Redis instance. When I scrape the staff profile inventory and discover "new" profiles, I toss these onto a queue for subsequent processing. New URLs may not mean a new person, so I have to check to see if we already know about the person described at the URL via ORCID and/or email address.

The main change that pursuing this issue resulted in is a different structure for the scraped profiles where I place multiple organization names/links into a list. I also put titles into a list because it looked like the HTML structure might end up with some of these cases (e.g., a different title for a person in a particular context?). In processing for claims, I then can include multiple "is affiliated with" claims for a person, linking them to multiple organizations with a "point in time" date qualifier.

The other big thing that I took care of here was an entity by entity update process that does several key things:

If an existing claim data value matches what's in the current profile data being processed, I leave this in place but update the point in time qualifier to the date the information was found
If there is a new value from the current profile data being processed, I leave the old claims in place with their last known good point in time qualifier and add the new information as a new claim

I saw this play out in several examples where people changed jobs and have a new organizational affiliation or a new title resulting in new "is affiliated with" claims or new "occupation" claims.

I also update labels and descriptions as needed if the advertised names or titles, respectively, from staff profile pages change. In the case of a name change, I move the previous label to the list of aliases for the entity, just to retain the reference point.

skybristol / geokb

Rework personnel profile scraper to address cases where a person is affiliated with more than one org #46