skybristol / geokb

Data processing workflows for initializing and building the Geoscience Knowledgebase
The Unlicense
3 stars 3 forks source link

Link USGS personnel to occupations #5

Closed skybristol closed 11 months ago

skybristol commented 11 months ago

Using title from USGS personnel profiles scraped and stored in the Item_talk pages for person entities, we can link many of these to standardized entities under either the "worker" class or the "USGS leadership roles" class. I need to rework some older code for processing titles and potentially other texts via entity recognition to produce and emplace the linkages. One important aspect of this is recording known dates (start and end) for these statements as qualifiers as roles and occupations change.

Included in this work will be identifying additional characteristics that extend beyond "occupation."

skybristol commented 11 months ago

One approach on this that I think could be kind of elegant would be to develop a per entity processing algorithm. I've generally used an approach where I pull a whole bunch of things, do some work to figure out what to do in terms entities and claims, and then run a big process to push stuff into the Wikibase. However, as we look toward a production state, breaking everything up so that an entity sort of kicks off its own specific work when needed might be the way to go.

The steps here for people entities would be something like the following: 1) Some process runs to check the USGS Personnel Profiles system for new entities we don't know about yet. I've built inventory scrapers in the past and need to revisit. From a GeoKB perspective, this would get "kicked off" from the item representing the personnel profiles data source. I might think about doing something crazy like storing that entire "take" from the inventory scrape on the source item since that "raw" data doesn't really have anywhere else to live. 2) The inventory spawns new person items and it should tell us when someone disappears, maybe resulting in a qualifier somewhere. There are still issues where the web people put a redirect in place for a name change or other issue that has to get dealt with somewhere. 3) Individual person entities have the URL to their profile page. We can then use this through another process to go check for changes. Unfortunately, there's no timestamp or anything on the pages to use, so we have to diff the scraped content and see if we need to drop an updated structure into the item talk page. This would be something that could happen asynchronously as a microservice on a schedule (e.g., a weekly check). 4) Updates to item talk pages could trigger a processor that would run through and do everything the content indicates (e.g., title matching/processing to derive occupation claims).

This same concept can apply to any case where I have a dynamic source with either a stable access point on the other end of something stored for the entities or some kind of caching approach like we are using here.

There's a little bit of a chicken and egg problem here in that I have to critically examine a source to determine what all needs to get built out in reference entities so the process actually results in linked data flowing into the GeoKB. But that can also be iteratively improved.

skybristol commented 11 months ago

It would ultimately be best to build on this to process everything from the cached profile information at once so we're only reading it from the Item_talk pages one time. However, I'm going to approach this in increments and build just the occupation piece first since I have a good number of those items established as a classes to link with.

skybristol commented 11 months ago

I've started this overall process of interpreting information in profile pages into linked data within the GeoKB with a focus on occupations. The algorithm I started also uses the title to determine if someone is a supervisor (also included as an occupation claim) and if they are evaluated under RGE (by the convention of using "Research" in front of a scientific discipline in their title). This is by no means a perfect approach, but it starts to flesh out the graph with further detail we can test and build upon.

I included the URL to the profile page from which the claims were derived as reference URL on the claims along with the timestamp when it was cached as a point in time qualifier. This indicates that the information was considered current as of a particular date.

What's missing in the reference for these claims is a pointer to the algorithm used and its code. I think what I want to do there is follow the pattern of including an item in the GeoKB that represents the coded algorithm and includes pointers to source code and any details about the execution environment. This would be pretty elegant and could allow for filtering on characteristics of the process used.