skybristol / geokb

Data processing workflows for initializing and building the Geoscience Knowledgebase
The Unlicense
3 stars 3 forks source link

Fix problems with NI 43-101 reports in GeoKB blazegraph store #45

Closed skybristol closed 4 months ago

skybristol commented 8 months ago

Problems that were going on with the Blazegraph store (houses the graph data representation of Wikibase items and backs SPARQL interface) a while ago caused a problem with many of our NI 43-101 items where things are coming back incorrectly in SPARQL queries. We show that we have the older schema populated in some cases where we actually don't and we do not have it populated when we do. To deal with this, I need to build a one-off script that will go through every item and take appropriate action based on the structure of the item encountered.

When I do this, I will also work through a newer method of source item harvest for the existing items to help set us up for a more standardized and foolproof method on all items we will source from any Zotero library. I'll use the same approach I've taken elsewhere in using the item discussion wiki page to store original raw source content. It will be a slow process but less prone to error to use the metadata URL for a given item and content negotiation (via the w3id.org abstraction) to pull the full JSON document for each item. I can then convert this to YAML and write to the item discussion page. These documents have everything about the source item but a second query will be needed to pull the attachment(s) for an item, which is what gives us the core attachment ID that can be used to fetch the actual file contents. We also have in this the version number, which gives us a basis for checking on when we need to fetch an update.

This approach will have the added benefit of giving us a full copy of what we are storing in Zotero (other than the file attachments) in an alternate format that we can build on if we need to move elsewhere. We'll have to build in another process to pull and stash file contents using some appropriate organization scheme to cold storage (e.g., AWS Glacier).

skybristol commented 4 months ago

The Blazegraph issues appear to be mostly solved at this point by the Wikimedia Deutschland team. I'm running a process to ensure we are synced with the Zotero collection, including processing through a handful of new items added recently. This establishes the following item information:

I also need to revisit a secondary process that consults the xDD API to pull back gddid values for those claims. There is current work going on to reprocess the NI 43-101 reports in xDD that I need to catch up with.

The next larger piece of work here is to link NI 43-101 reports with entities representing mines/prospects/projects. This is challenging due partly to the existence and lack of conformance between other databases describing mines and mine features (MRDS, USMIN, GNIS, etc.). We have concept for classifying the "mines" that these reports refer to partly worked up in the GeoKB ontology but need to continue harmonizing that with various sources. These are important because it is actually the mine "feature" entities that the NI 43-101 reports are making claims about (commodities, etc.) and that we need in our data infrastructure to work against. As we start pulling graph fragments together by processing documents with RAG techniques, we need to figure out what those fragments are going to link to.

skybristol commented 4 months ago

Also included in the above work is a refresh on each item talk page with a YAML transformation of the item data structure along with all attachments.