Open skybristol opened 12 months ago
I've ended up making a bunch of changes as part of this work to address other issues that had been lingering or that I encountered along the way. I've reworked the process now into a couple of different parts. The first stage builds what amounts to an artificial sitemap from the USGS Staff Profiles inventory. I cache a raw list of profile names (last part of the URL) scraped from the inventory to the item talk page of the source item in order to maintain a historic record of points in time where this list changes.
I also started using a message queuing strategy again with a Redis instance. When I scrape the staff profile inventory and discover "new" profiles, I toss these onto a queue for subsequent processing. New URLs may not mean a new person, so I have to check to see if we already know about the person described at the URL via ORCID and/or email address.
The main change that pursuing this issue resulted in is a different structure for the scraped profiles where I place multiple organization names/links into a list. I also put titles into a list because it looked like the HTML structure might end up with some of these cases (e.g., a different title for a person in a particular context?). In processing for claims, I then can include multiple "is affiliated with" claims for a person, linking them to multiple organizations with a "point in time" date qualifier.
The other big thing that I took care of here was an entity by entity update process that does several key things:
I saw this play out in several examples where people changed jobs and have a new organizational affiliation or a new title resulting in new "is affiliated with" claims or new "occupation" claims.
I also update labels and descriptions as needed if the advertised names or titles, respectively, from staff profile pages change. In the case of a name change, I move the previous label to the list of aliases for the entity, just to retain the reference point.
I've run into cases where Staff Profile pages list more than one organization that a person is affiliated with, mostly a laboratory within a Science Center. The web scraping routine I used didn't account for these cases, and I need to rework that element. This will re-cache the scraped content in item discussion pages and add additional affiliation links to the graph.