tdwg / tag

Technical Architecture Group
https://tag.tdwg.org/
5 stars 0 forks source link

Best practices for borrowing terms from other vocabularies #39

Open baskaufs opened 1 year ago

baskaufs commented 1 year ago

The Material Sample Task Group was charged with cleaning up some of the confusion surrounding terms used to describe material objects in Darwin Core. To facilitate this, a proposal was made to borrow the term dcterms:PhysicalRecource from Dublin Core to use as a parent class for all kinds of material objects. During discussion of this proposal, several counter-proposals were summarized in the Material Sample issues tracker. At the 2023-01-18 Material Sample TG meeting, the group plans to recommend one of the alternatives as a consensus recommendation to be proposed for addition to Darwin Core.

This issue raises several questions about best practices for borrowing non-TDWG terms to become an official part of a TDWG vocabulary. If the TAG has any thoughts about the technical merits of the different approaches, it would be timely to express them prior to January 18 to provide advice to aid the Material Sample TG in their decision-making.

Existing practice within TDWG

Darwin Core currently borrows one class term (dcterms:Location) and nine property terms from Dublin Core DCMI Metadata Vocabulary. I believe that Darwin Core was designed to be “Dublin Core-like” in its structure and operation. (@tucotuco can you confirm this?) Darwin Core does not officially import terms from any other non-TDWG vocabulary, although several are recommended for use when expressing Darwin Core as RDF. These Dublin Core terms are versioned and an attempt has been made to be explicit about the version that was incorporated into Darwin Core. There have been some proposals to import terms from other vocabularies (for example https://github.com/tdwg/dwc/issues/40 and https://github.com/tdwg/dwc/issues/38), however thus far none have been adopted.

Audubon Core borrows many terms from several other non-TDWG vocabularies. During the creation of Audubon Core, I believe that the task group operated under the principle that they should only mint new terms when there was not a well-known existing term that could be used. The issue arose about what should happen if the non-TDWG terms were updated to new versions and the Audubon Core Maintenance Group adopted a policy that non-TDWG terms should be frozen at the version originally adopted unless an explicit decision was made to update them.

The Vocabulary Maintenance Specification states in Section 1.4 that TDWG vocabularies include simply-defined terms without range and domain declarations or additional semantics. Those terms can be enhanced by adding additional feature layers that generate entailments.

Technical questions:

  1. Broad, well-known terms vs. narrow minted terms. Is it more advantageous to use a term from a well-known vocabulary that has a broad definition (e.g. importing dcterms:PhysicalResource) or to mint a more narrowly defined term (e.g. minting dwc:MaterialEntity) that has meaning that is more specific in its intended use?
  2. Ontologically-defined terms vs. "bag of terms" level terms. Should a more well-known term having declared semantics (sub- and super-class, disjoints) be imported (e.g. bfo:0000040; material entity) or should TDWG mint its own term (e.g. dwc:MaterialEntity) and link it formally or informally to the more well known term to keep the new term at the “bag of terms” level?
  3. Opaque vs. camelCase local names for classes and properties. Adopting bfo:0000040 would deviate from the current practice of limiting class local names to UpperCamelCase English phrases. (Controlled vocabulary concepts currently have opaque local names.) Is that a problem?

Please note that this issue is primarily concerned with the general principles illustrated by these alternatives rather than the merits of the specific alternate proposals themselves (i.e. not concerned with details of the definitions, exact choice of labels, etc.).

Please express relatively succinct opinions as comments here. For complex or extended discussion, please use the TAG Slack channel. This topic will also be discussed at the 2023-01-09 TAG meeting.

jbstatgen commented 1 year ago

The future will require standards to support reasoning and with that machine-actionability. Already now the question of hierarchies is acute, eg. in the Material Sample and LtC task groups. Thus, a best practice should pre-prepare DwC for structure and relationships based on meaning, logic and functions.

tucotuco commented 1 year ago

Darwin Core currently borrows one class term (dcterms:Location) and nine property terms from Dublin Core DCMI Metadata Vocabulary. I believe that Darwin Core was designed to be “Dublin Core-like” in its structure and operation. (@tucotuco can you confirm this?)

Yes, I can confirm this, but the design was mostly to have a well-entrenched model to work from and extend, not from any particular commitment.

tucotuco commented 1 year ago

The future will require standards to support reasoning and with that machine-actionability. Already now the question of hierarchies is acute, eg. in the Material Sample and LtC task groups. Thus, a best practice should pre-prepare DwC for structure and relationships based on meaning, logic and functions.

I agree, but should this be done piecemeal starting now, or should it be done as a comprehensive task?

ben-norton commented 1 year ago
  1. We established the following during the Latimer Core Review. The plan is to include this information as part of the standard (with a table of SKOS mappings) and release the general documentation separately (or some combination therein). In brief, a term can be borrowed from another vocabulary if and only if it is a skos:exactMatch. Exact matches are the only transitive mapping type and therefore the only viable mapping type for borrowing terms (if x = y and y = z then x = z). https://www.w3.org/TR/skos-reference/#mapping This mapping is subject to a set of conditions. In regards specifically to the above, both terms must be of the same type (class vs property). Borrowed terms should always be prefixed with their source namespace. If the two terms are not exact matches, then a new term is created and a SKOS mapping relationship between the two terms is documented accordingly. The term can be broader or narrower, but the latter is more common given the narrower scope of TDWG standards and the broader standards such as Dublin Core.
  2. Option 2.
  3. The format issue is an important one for transitive terms. If two terms are an exact match, but the borrowed term is snake case (the_name) and all TDWG terms are lower camel-cased, we have a problem. I'm still working through this issue, but here's what I can say. It is important to make sure the exactMatch relationship between two terms is machine-readable. To do so, the source naming convention most likely has to be preserved somewhere and the location must be machine-accessible. I think the opposite is true for the human-readable form of the term. Consistency is key. For a human-readable version of an exactMatch term, the target naming convention should be preserved and the names of borrowed terms that don't share the same convention should be converted to the target convention (camel cased for TDWG).
jbstatgen commented 1 year ago

@tucotuco

I agree, but should this be done piecemeal starting now, or should it be done as a comprehensive task?

That seems to be a good starting point for a dedicated TDWG strategy process and/or a task group.

@baskaufs asked for succinct answers and to refrain from discussions, a focus that I support. I also don't have the resources to go into thematic work at this point.

cboelling commented 1 year ago

If there is specific input of the TAG beyond what is documented in the TAG meeting notes linked above, please let the Material Sample TG know.

Regarding the technical questions above:

Re1: A term should not be imported if its meaning (as per its definition) is different from the intended meaning of the term to be added or if the equivalence in meaning cannot be established. It is not advisable to constrain the use of a term with broader meaning through scope notes which usually are only used as human-readable annotation. The intended meaning should coincide with the admissible use. Scope notes or usage comments should only be concerned with additional clarification and pragmatics of use.

Re2: I think that one should be ready to accept entailments that follow from statements about the original term made by the original creators (in the case at hand, the subclass-relationships in BFO). If even a term is imported in isolation and is being used as part of a flat terminology and in canonical serializations (DwC Text, DwC XML) the federated and cascading nature of data use in computational knowledge engineering means that it is entirely possible that statements involving the imported term and other DwC terms get combined with other statements in RDF graphs or datasets and are subject to entailment regimes in which case entailments ensuing from statements made elsewhere will follow even though they are irrelevant in current uses of DwC-encoded data.

Re3: I understand that the SDS does not constrain a setup in which the local name is opaque. The issue is appears primarily to be a pragmatic one, because currently applications processing DwC data seem to rely on local names rather than labels. In many cases, developers of knowledge organisation schemes seem to build their resources with local names as part of IRIs intended to being used for "back-end" purposes while the associated labels are intended for use in application interfaces. From this point of view may be its advisable to develop a similar policy in TDWG, i.e. making clear that downstream applications cannot rely on local names to be informative. Existing IRIs and local names could be continued to be used under such a policy and they could exist alongside opaque local names.

baskaufs commented 1 year ago

@cboelling I requested discussion in the TAG slack, but thus far (morning before MS TG meeting) there has been little discussion other than some general discussion about using SKOS to do mappings. At the meeting, it was suggested that members with perspectives on past practices should attend the MS TG meeting where this will be discussed. I think that @tucotuco plans to attend. I can't make the first meeting due to a staff meeting, but I plan to attend the second one.

With respect to your Re2: comment. I agree that if one imports a term with entailments, that we need to accept them because we don't know whether users may use that term in a situation where those entailments may be computed automatically. That's why the current TDWG standards governing term creation (SDS and VMS) specify that we should start with terms at the "bag of terms" level and add the semantics as layers on top of that. This approach didn't come out of a vacuum. There was a massive discussion involving the community broadly via the TDWG email list between 2009 and 2011 (summarized here) about RDF and Linked data and it's role in Darwin Core and in TDWG in general. The "bag of terms" + layers approach came out of that discussion and was incorporated as a consensus view into the SDS and the VMS when they were written. That consensus and TDWG adherence to that approach since then is the reason why we need to consider carefully whether we want to deviate from it in this proposal.

With respect to the Re3: comment: Yes, it is permissible to have opaque local names. That has generally been the practice with controlled vocabularies (see for example this where there are term names like dwcem:e002). It has NOT been the practice for vocabularies that define property and class terms. Those vocabularies are older and developers have (perhaps unwisely) built applications that depend on the local name parts of term names as both unique identifiers and labels for users (which they are not). There is already a ratified policy against this: the SDS says in section 3.3.3.1 "The term name is often related to the meaning of the term, but users MUST NOT attempt to understand the meaning of the term by interpreting its name. Rather, the term definition MUST be consulted." However, it's fine to put in the RFC 2119 key words "MUST" and "MUST NOT", but there isn't any enforcement mechanism. The reality is that people use them in that way and we have to deal with that, at least until the applications that have been built conform to this requirement.

baskaufs commented 9 months ago

This was discussed at the 2023-09-11 TAG meeting and the TAG requested that this issue be taken up by the Mapping Task Group. @DavidFichtmueller indicated that it would be added to the tasks of the Mapping Task Group in its draft charter.

DavidFichtmueller commented 9 months ago

This was discussed at the 2023-11-09 TAG meeting

@baskaufs : You mixed up the date from the last meeting there with the date from yesterday. I thought for a second, that I had missed the meeting yesterday and wondered why I had it in my calender for Monday.

baskaufs commented 9 months ago

Thanks for catching that, David! I've fixed the error.