skybristol / geokb

Data processing workflows for initializing and building the Geoscience Knowledgebase
The Unlicense
3 stars 3 forks source link

align igneous rock classification with Huber source #22

Closed skybristol closed 10 months ago

skybristol commented 10 months ago

Mike Zientek pointed to another source for igneous rock classification that needs to be organized into the GeoKB, harmonized with the original source from Mindat.

https://terminologies.gfbio.org/terminology/LIT_I

The API from gfbio could be used here or we might simply download the SKOS+RDF source and write a process to link terms between what we have currently in the igneous rock classes and the Huber source. We should be able to add same as links to the URIs from this source for exact matches. We can then look at what's missing and decide how to deal with those. The hierarchy here is SKOS broader/narrower relationships, so another task in this is checking to see if there are differences in the classification system used by Mindat and this source.

skybristol commented 10 months ago

Another option to explore on this would be transforming the RDF into a flat table and then using OpenRefine to reconcile names. That is probably worth a try since the Xentity folks still have that reconciliation service operating temporarily.

skybristol commented 10 months ago

It turns out that OpenRefine has an import function for RDF+XML which I guess I'd forgotten about. This means we can simply plug in the URL to the full download for the Huber ontology and get that into table format:

http://www.lithologs.net/skos/igneous_rocks/all

I'm experimenting with this now using the PAWS implementation of OpenRefine. I'll report back on what all is needed to spin up the reconciliation service against the GeoKB.

skybristol commented 10 months ago

I've gotten far enough with the mechanics of the OpenRefine route that I think this could be a viable approach for this use case. It does point out the potential need to go back to a basic instance of claim ("igneous rock" perhaps) as OpenRefine expects a type classification to narrow the search parameters. I had initially specified every item built from Mindat as an instance of "rock," but changed this to represent each of these items as classes vs instances when I ran back through and adjusted how the Mindat identifiers were encoded. I'll see if there's a way around that first.

In looking through these records, we will need to handle some things like altLabel and variants, meaning that we need to reconcile these values as well and work out the schema details for committing anything back.

skybristol commented 10 months ago

One thing to keep in mind on this is a core use case that sparked this investigation:

The new LIMS module that handles sample submittal and management for the Analytical Geochemistry Lab (G3) needs an improved pick list/lookup source of rock types for submitters to use when submitting samples for analysis. This metadata will be used in the final distributed data to provide for selection and search capability. It will be useful to incorporate rock classification as well as potentially other characteristics in the next-gen NGDB as further search and filter criteria. Sourcing the LIMS configuration information (pick list or however that ends up working) from the GeoKB means we can "scoop up" this additional detail for rock types in an automated pipeline process that distributes final data released by the lab.

skybristol commented 10 months ago

One key architectural element I want to capture here for our thinking is the reconciliation API. Here's a nice blog post on the basic idea we're pursuing. One of the technical deliverables we got from our Xentity contract is this docker compose that spins up a Redis container that operates a small service to handle the reconciliation back and forth between OpenRefine (or some other client) and a Wikibase instance.

I am in discussion with the wikibase.cloud team on getting this spun up as part of that infrastructure. If we don't work that out and we determine viability for this method, we'll need to run this critical component somewhere. Xentity gave us the basic engineering route to set this up, and I'll also provide a live URL to the service they still have running temporarily for our testing. This gets plugged in to an OpenRefine instance (on PAWS or wherever).

This service is based on work from a W3C working group that specified a reconciliation API standard. This could prove useful, architecturally, well beyond what we are working on here. Think about all the other cases where we need to reconcile some kind of imperfect identifier (e.g., a name) against some registry or data platform, pulling back a persistent resolvable identifier to use in some context. For instance, I can envision a reconciliation service operating against GeoConnex when we have a dataset that needs to make reference to the registry of stream gages or other spatio-temporal features that platform is helping to clarify.

skybristol commented 10 months ago

After exploration of this option with Zientek, we are abandoning this route in favor of the Geoscience Ontology (#25).