skybristol / geokb

Data processing workflows for initializing and building the Geoscience Knowledgebase
The Unlicense
3 stars 3 forks source link

Rework personnel profile scraper to address cases where a person is affiliated with more than one org #46

Open skybristol opened 9 months ago

skybristol commented 9 months ago

I've run into cases where Staff Profile pages list more than one organization that a person is affiliated with, mostly a laboratory within a Science Center. The web scraping routine I used didn't account for these cases, and I need to rework that element. This will re-cache the scraped content in item discussion pages and add additional affiliation links to the graph.

skybristol commented 7 months ago

I've ended up making a bunch of changes as part of this work to address other issues that had been lingering or that I encountered along the way. I've reworked the process now into a couple of different parts. The first stage builds what amounts to an artificial sitemap from the USGS Staff Profiles inventory. I cache a raw list of profile names (last part of the URL) scraped from the inventory to the item talk page of the source item in order to maintain a historic record of points in time where this list changes.

I also started using a message queuing strategy again with a Redis instance. When I scrape the staff profile inventory and discover "new" profiles, I toss these onto a queue for subsequent processing. New URLs may not mean a new person, so I have to check to see if we already know about the person described at the URL via ORCID and/or email address.

The main change that pursuing this issue resulted in is a different structure for the scraped profiles where I place multiple organization names/links into a list. I also put titles into a list because it looked like the HTML structure might end up with some of these cases (e.g., a different title for a person in a particular context?). In processing for claims, I then can include multiple "is affiliated with" claims for a person, linking them to multiple organizations with a "point in time" date qualifier.

The other big thing that I took care of here was an entity by entity update process that does several key things:

I saw this play out in several examples where people changed jobs and have a new organizational affiliation or a new title resulting in new "is affiliated with" claims or new "occupation" claims.

I also update labels and descriptions as needed if the advertised names or titles, respectively, from staff profile pages change. In the case of a name change, I move the previous label to the list of aliases for the entity, just to retain the reference point.