skybristol / geokb

Data processing workflows for initializing and building the Geoscience Knowledgebase
The Unlicense
3 stars 3 forks source link

Incorporate all minerals from the GSO source #31

Open skybristol opened 10 months ago

skybristol commented 10 months ago

We need to finish building out the minerals reference in the GeoKB. This was started with a combination of MRData and Mindat reference materials, where I attempted to develop a new type of conceptual mapping to named entities that can be classed as minerals, commodities, or even chemical elements. While we may need to revisit the notion of single element minerals, the mineral commodity approach seems like it should work.

This work will include bringing in a full representation of the GSO minerals module processed through software code, factoring in where I have already instantiated some items from other sources.

skybristol commented 9 months ago

I've started working through the contents of the GSO minerals module and have come up with a few questions we need to answer before I proceed with pulling this into the GeoKB representation. I've attached a simplified Excel table with the major elements from the GSO minerals schema that we might be able to make use of. My questions are based on this.

gso_minerals.xlsx

  1. Do we want to incorporate all minerals from this list or take more of the approach that Peter Schweitzer used for certain things in the USGS Thesaurus where we only bring in selective parts of larger "vocabularies" that are directly related to USGS work? If we do slim it down, how do we decide which minerals to include? Note that the subclass_of field here does contain classification that places certain mineral items as subclassed under other mineral items (e.g., kaolinite is both a mineral material and part of the kaolinite subgroup, which is its own entity).
  2. The primary source for the GSO minerals was RRUFF. They included values in the rruffnameplain column that I'm thinking of using as our primary label when those are different from the value in label because certain queries will work better with values that do not include linguistic special characters or HTML (e.g., Zincohogbomite-2N2S vs. Zincohögbomite-2N2S). We would include the values from label where those are different as aliases, although the example there isn't going to work anyway as we cannot include HTML tags in aliases within Wikibase. Let me know if this is a bad design choice for some reason.
  3. The crystalsystem property in the GSO is a little bit suspect, and I'm not sure we can trust those values. A spot check shows that they do not always align with either RRUFF or Mindat where I'm only seeing single values. The provenance of this property is not discussed in the OFR that published the GSO. If crystal system is useful, I can pull it from Mindat in another step since we have those identifiers. We would model this into the GeoKB such that we have items for the crystal structure types turning them into more functional queryables. But I'd like to know if this is a valuable classifier or not.
  4. The structuralgroup property contains logical linkages to groups that are also classified in the GSO minerals module as "mineral_material," and I'm not entirely sure why they opted to encode it the way they did with simple string values. RRUFF refers to these as "mineral group" and Mindat shows them as a "member of" labeled property. This seems like something we would want to capture in the GeoKB representation. We could include these in the GeoKB using the "part of" general property we already have, simply indicating that one thing is a part of another. We could also provide a reciprocal relationship using "has part" to aid in queries and visual organization of the information. Or we could create a new property for "structural group" or "mineral group" or the like if someone can provide a reasonable way that we would define that property. I also don't understand what fleischersgroup is. In most cases, this is the same value as structuralgroup, but there are exceptions (e.g., Zincrosasite indicates Malachite for "structuralgroup" vs. "rosasite" for fleischersgroup). If someone can explain the significance of these concepts and indicate whether they are useful, that would be helpful.
  5. The Strunz classification in the GSO seems to be simplified from what is in Mindat. Folks had indicated that this would be useful to have in the GeoKB for query purposes, which means we need to first represent the Strunz classification itself in some way so that there are actual items to link with and then incorporate these into the schema for mineral items. It would be useful to know if the strunzcodeV10 and strunzlabel values as used in the GSO mineral module are reasonable and a useful set of codes and labels.
  6. Folks had indicated previously that indicating whether a mineral is on the IMA list or not is useful. In the GSO ontology, we have both a multi-valued imastatus property and a imanumber. Should we incorporate one or both of these as new properties for claims in the GeoKB? The imanumber would be an ExternalID type of property, simply recording the string. The imastatus would be an Item type property pointing to part of the classification scheme defining what the terms (e.g., "grandfathered" mean in the IMA mineral list context).
jrosera commented 9 months ago
  1. I do not think we need all of the minerals for our purposes. I think a good starting point would be pulling out all of the unique mineral names from MRDS. Most of the relevant "economic geology" minerals will be in there, and we can relate things from there.
  2. I am okay with using the 'rruffnameplain' labels.
  3. 'crystalsystem' is not all that important for what we do. Feel free to skip it, or if you import it, flag it as unvalidated or something.
  4. The key point here is providing a reciprical relationship between 'structuralgroup' and the primary label. For example, very few geologists would probably realize what you are talking about if you mention zinrosasite (it was a new one to me), but all of us economic geologist know malachite. Economic geology reports and whatnot often describe things at an observational level, which is sometimes close to the 'structuralgroup' value captured here. Another example might be the Columbite structural group. If an economic geologist is working in an area known to contain columbite, they will probably be able to identify it with a microscope or whatnot. However, they will not be able to immediately tell if it is Columbite-(Fe) or Columbite-(Mn) (these are rruffplain labels for Columbite group minerals) without some sort of chemical measurement. So, I imagine we will be working off of structural group-level data quite often as we compile information from mineral reports.
  5. I skimmed around and I am not sure how it is simplified? It appears to follow the NN.XY.### format, although not all of the minerals have 3 numbers in the ### spot (this represents the mineral or group). When I have used Strunz codes in the past to aggregate mineral occurrence data, we aggregated to the X and NN levels, which are the two highest orders of the hierarchy. That hierarchical detail seems to be present here, so I say we just use it. Unless I am missing some detail?
  6. I still do not see much use for IMA status in our work, specifically. If it is easy to roll over with what you are doing, then sure, but if it turns into any sort of challenge, I would say to drop it.
skybristol commented 9 months ago

@jrosera - Thanks for the comments. That's perfect!

I'll work in a process to essentially consult MRDS to subset the full list of "mineral materials" from the GSO. I think that seems reasonable, and we can always come back and pull in more entities as use cases expand beyond economic geology.

Thanks for providing feedback on things to ignore. There is no need to pull in additional information beyond what we will actually use as we're not trying to set up the comprehensive resource here.

For the structural group part of this, I think I'll start with the existing part of/has part properties we already have. These are designed to produce the reciprocal relationship you're talking about. There are different schools of thought in the semantics world on using a bunch of different specific properties vs. more general relationships. I go back and forth, and we can always rework the relationships at a later time if needed.

I had worked out the basics of a method to pull in the full Strunz classification system from source material. I'll look at that again but may end up simplifying that for our immediate purposes and use only the part of the classification referenced in the GSO.

jrosera commented 9 months ago

@skybristol

I cannot speak too much to ongoing semantics debates, but I would just recommend that you keep in mind that very specific mineral names often require lab methods that are not necessarily part of routine economic geology surveys - especially the older reports that more or less list out mineral phases that were observed in mapping / drilling / thin section etc. While there is great information baked into specific, end-member mineral names based on full geochemical characterization, much of what we work with is at the slightly more general level (e.g., structural group).

My guess is that if you pull all of the unique MRDS materials flagged as ore or gangue you will have a mix of names from specific to structural group.

skybristol commented 9 months ago

@jrosera - That's the approach I'm taking this morning. Looking at that list of 743 unique names found in ore or gangue, we do not have complete alignment with either the Geoscience Ontology or Mindat.

The bottom line is that mineralogy is complex just like biological taxonomy that's closer to my own domain. Every scientist that approaches this problem of classification and identification is going to come at it a little bit differently, and the whole system is in flux at any given point with no perfect way of capturing the complexity of the state of scientific knowledge in the simplicity of linked open data and explicit semantics. As you say, at some point in time, some things that show up in an attempt to classify everything are going to be in the process of being studied and better characterized.

Our overall goal with the GeoKB is to have named and identified entities from all the subject matter we study and encounter in our specific scientific portfolio. This is conceptually similar to what Peter Schweitzer has done with much of the USGS Thesaurus work, but we are taking that further into other domains and focusing significant attention on linking "our concepts" with linkable entities from other knowledge representations in order to do the following:

  1. help other people (and ourselves) understand what we're talking about
  2. provide an avenue to go after further details from reference sources when we need them
  3. help distinguish our unique scientific contributions so that other people/systems can leverage those

Decision Points

  1. Make sure the 743 named entities from MRDS have entities in the GeoKB. Names will either be primary label or alias, depending on if we identify a preferred name in a trusted source.
  2. Items will be classified as mineral material and potentially another class if indicated by a trusted source.
  3. Include same as claims to either the Geoscience Ontology, Mindat, or both. This is our assertion that we are functionally talking about the same entity and expect to be able to follow those links to further information as needed.
  4. For the structural group concept, I'm thinking to introduce two new properties, member of and has member, to accomplish reciprocal relationships. These would be semantically distinct from part of and has part, and we'll work at further clarifying the significance of these properties through time and practice.
  5. Once I establish connections to GSO/Mindat, I'll look for any differences in Strunz classification between the sources. We'll have to decide on a trusted source for that information if it comes to it. I'll introduce at least those aspects of the Strunz classification we need to link to but keep that as simple as possible so we can just use these as grouping attributes.
  6. If there is additional information that could prove useful as reference (e.g., comments from GSO items, links to other resources), I'll drop these into item discussion pages so we have it ready at hand.

Questions

  1. The GSO includes a property with a list of chemical elements. I don't know the exact provenance on this, but it appears to align with the two references to chemistry (IMA and RRUFF). Since we have chemical elements as identified items in the GeoKB, is it useful to establish linkages on chemical composition? This would allow us to efficiently answer questions like, "what minerals have calcium and magnesium?" (I actually don't know why the GSO did not encode linkages to chemical elements more explicitly like this since they incorporated a module for elements and could have handled this better than a string list.)