tdwg / tag

Technical Architecture Group
https://tag.tdwg.org/
5 stars 0 forks source link

Consider use of SKOS-XL for labels across TDWG vocabularies #22

Closed baskaufs closed 10 months ago

baskaufs commented 5 years ago

I have been tracking the development of several vocabularies and a common issue that comes up repeatedly is the issue of handling the variety of strings that are provided as values for property terms. In particular, there is a need to associate additional metadata with those stings - for example, to record the provenance of a particular string.

Although we give these strings a variety of names ("string values", "verbatim literals", "names", etc.), they are all effectively labels of some kind. Rather than developing ad hoc terms for dealing with these labels in each new vocabulary that is developed, it seems preferable to have a consistent approach for dealing with labels across vocabularies with TDWG. The W3C SKOS Standard contains an extension known as SKOS-XL that is specifically designed for handling metadata about labels. After thinking about it for a while, I'm pretty sure that SKOS-XL could be adapted to most of the label-related use cases that we have across developing vocabulary standards within TDWG.

I have prepared a document with some background on SKOS-XL that includes examples showing how it could be used with labels for people and taxonomic names. The examples primarily show how the SKOS-XL model would be applied in a table-based database system and in a graph-based system (i.e. Linked Data). My understanding of how SKOS-XL might be used came from looking at examples from the work of one of our peer organizations, the Getty Thesaurus of Geographic Names (TGN). The TGN faces a problem similar to ours - tracking many labels derived from a multitude of sources. In the background document, I've linked to a TGN example and provided an abbreviated version of TGN metadata in an appendix.

If it appears that the SKOS-XL approach would be appropriate to use across TDWG vocabulary standards, then a next step would be to adopt standardized cross-vocabulary properties to describe the skosxl:label instances. TGN uses terms they have minted themselves and it's possible that we might just use theirs. Alternatively, we might look at the W3C PROV ontology to see if there are parts of it that might be applicable in this situation. The Attribution Interest Group has adopted some terms from PROV in their work.

After I finish submitting this issue, I'm going to try to link to other issues that are relevant to this one. If you have thoughts about this suggestion, please post them as comments here.

nielsklazenga commented 5 years ago

Hi @baskaufs . It would be good to add an extra column to the personLabels.csv with the form of the label (or how it has been derived, or where it comes from), like Tropicos has 'MO Abbreviation', 'FNA Abbreviation' etc. Different forms can be preferred for different uses. For example, 'Müll.Hal.' (without a space) in my example (in tdwg/tnc#24) is not actually a spelling error, but is the IPNI standard form (and is in accordance with the Authors of Plant Names standard) and is preferable for use in the authorship of a botanical name. On the other hand, in the authorship of a publication, 'Müller, C.' (or lastName + ', ' + initials) would be preferable. So, rather than preferred and alternative labels, we are talking about different canonical forms for different purposes.

While Names are things (or can be things), the authorship of a name is really just a string. Of the constituting parts, which are Agents, only 'authors' (or 'CombinationAuthors' in TCS), can be directly linked to the Name; 'basionymAuthors' are authors of a different Name, the basionym, so this is a relationship between Names, rather than a relationship between an Agent and a Name; and 'exAuthors' are not really authors at all, but an optional part of the attribution. It doesn't make sense to have a nameThing–Person join table, as (1) the relationship between a Name thing and an author Agent is one-to-one and (2) the author Agent is often a Group.

'Dicranum braunii Müll.Hal.' is an unfortunate example, as it is just a string, not a Name thing (and at best an incorrect label for a Name thing). 'Dicranum braunii Bosch & Sande Lac.' and 'Dicranum braunii Müll.Hal. ex Bosch & Sande Lac.' are alternative labels for the same Name thing. You are forgiven though, as both the GNA parser and Tropicos get it wrong as well (for different reasons). So the variation in name strings are not only caused by minor (or major) variations in the author name, but can be caused by incorrect authors, or orthographic variants of the name itself (which would be treated as different Name things if one would bother with them). This sort of thing will happen an awful lot and for this reason I see little benefit of having alternative labels for Name things. There should just be the one canonical form. We could use skos:prefLabel for that, but it might be better to (also?) use dc:title, as is already done in the TDWG Taxon Name LSID Ontology.

The same goes for other resources where there is a similar proliferation of "alternative labels", or there is not really a canonical form. For example, a dc:title property on the tcs:TaxonConcept would be the literal alternative of the OpenBiodiv-O TaxonomicConceptLabel and in my interpretation of tdwg/dwc#181, a dcterms:title property on the dwc:Identification object (i.e. verbatimIdentification) would do the trick.

For resource types where there are a limited number of alternative labels, like Agents and Places, SKOS-XL could come in handy, but I am not sure in how many TDWG standards you'll find these. Controlled vocabularies for terms come to mind, but there you can probably do without the extension. Traits (SDS) maybe...

All my personal opinion of course; other members of the TNC might think I'm stark raving mad.

baskaufs commented 5 years ago

Hmm. Well if it is true that there is only ever a single string (label) for an entity, then there isn't much of a point in using SKOS-XL. Then you just make statements (such as provenance information) about the entity itself and assume that the statements that you make about the entity also apply to the string. In that case, it doesn't much matter if the string is linked by rdfs:label, skos:prefLabel, dcterms:title or whatever.

However, I was under the impression that mistakes/wrong/non-preferred labels for things were relatively common and that there might be a desire to track them. If I say "Dicranum braunii Mull.Hal." instead of "Dicranum braunii Müll.Hal.", that is a different string because I mistakenly used "u" instead of "ü", right? In that case, if we want to track who used the incorrect string and when, then instantiating two separate SKOS-XL instances (one for the correct/prefLabel, and one for the incorrect/altLabel or hiddenLabel) would make sense. In addition creating one to many altLabel or hiddenLabel instances and linking them to the entity provides a means to direct users to the right prefLabel without instantiating an entity instance for every wrong label. Do we really want to create a new "name thing" record if something is clearly a typo or character encoding error? Maybe so - I don't claim to understand all of the idiosyncrasies of taxonomic nomenclature.

If there really are few use cases for SKOS-XL, then this suggestion can just happily die a quiet death.

nielsklazenga commented 5 years ago

This is probably not the place to explain the intricacies of taxonomic nomenclature, so let's suffice it to say I wasn't dismissing your suggestion at all, just annotating some of the examples.

I think, rather than actively looking for situations where extended labels might be useful, or applying them anywhere, or dismissing the idea if we can't immediately find something to hang them on, we should keep the SKOS-XL option in the back of our minds and apply it when we think it is the best solution.

nielsklazenga commented 5 years ago

@baskaufs, I realised that I am probably too close to the examples you chose to see the bigger picture. I've spent some time playing around with your example. My take on it is in this gist. See if you can live with what I did. I have also created an issue in the TNC repository (tdwg/tnc#25) for further discussion.

baskaufs commented 5 years ago

Love it. (reference this comment)

baskaufs commented 10 months ago

Some aspects of this have been adopted in the draft TCS specification. Otherwise, this doesn't seem to have gotten much traction, so I'm closing it.