tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
201 stars 70 forks source link

Change term - basisOfRecord #416

Closed tucotuco closed 1 year ago

tucotuco commented 1 year ago

Term change

Note: edited from original proposal to accommodate commentaries. The original proposal was to change the examples to use the English language values of the labels. The updated proposal is to comment on the recommendation to use controlled vocabularies. This is also consistent with the comments for the term MaterialCitation.

Current Term definition: https://dwc.tdwg.org/list/#dwc_basisOfRecord

Proposed attributes of the new term version (Please put actual changes to be implemented in bold and ~strikethrough~):

peterdesmet commented 1 year ago

Note that this change would affect processing of data by GBIF, OBIS and any downstream use. Many datasets would still use the old terms, so those systems would have to deal with both versions. In addition, it might be confusing to publishers that type values do still use CamelCase values like StillImage. I fear the change might do more bad than good?

timrobertson100 commented 1 year ago

Note that this change would affect processing of data by GBIF, OBIS and any downstream use.

This won't affect GBIF processing per se, as GBIF.org (and ALA) would already interpret those values to the concepts we use. However, I think it is unlikely GBIF would change the GBIF.org output (i.e. how we serialize the concepts) from the current API (example record) and the downloads as they have been stable for >10 years and a change would affect many for no real benefit. Since GBIF currently uses UPPER_SNAKE_CASE as a convention for controlled terms anyway, I don't think it should be a reason to block this change.

Personally, I'm indifferent to this change as consumers always need to handle variety in serializations, but like @peterdesmet I think we should aim for consistency within DwC for controlled terms (e.g. StillImage, widespreadInvasive) and ideally across TDWG standards.

peterdesmet commented 1 year ago

Thanks Tim, good to see that it likely won't break things. What remains is the inconsistency it introduces on how to write controlled terms.

baskaufs commented 1 year ago

I agree that this should be cleaned up to eliminate the inconsistency that John has pointed out. However, it makes more sense to me to change the notes to reflect actual practice rather than to change the examples, which most people have probably been treating as a recommendation. That would be less disruptive and serve the same purpose of eliminating the inconsistency, while also keeping the suggested values consistent with the emerging precedent of using CamelCase for controlled value strings.

In that case, I would change the Notes to the following:

Recommended best practice is to use the local name of one of the Darwin Core classes.

Please note that the DwC RDF Guide would also have to be changed because the normative Table 3.4 states the following about basis of record:

MUST be used only with literal value strings consisting of the local name component of Darwin Core class IRIs. Use rdf:type to refer to IRIs that describe the type of the resource.

Some additional more detailed comments:

  1. I used the term "local name" as I think it's the best thing to call the part of the IRI following the namespace. That's consistent with the XML standard and usage in section 3.3.3.1 of the SDS. It differs slightly from the CURIE syntax document , which uses "local part".
  2. I feel like making the change John has suggested is somewhat of a move backwards, since we've really been pushing people to understand the difference between a controlled value string and a label. With this change, we'd be taking a situation where people have been using very specific controlled value strings and telling them to use the label. That opens up a whole can of worms, because labels come in many languages and have potentially many more variant forms (spacing, capitalization) than a very rule-based system like UpperCamelCase.
  3. On option here would be to formally add a "Controlled value" field for the class terms. Like dc:type, dwc:basisOfRecord is a bit of an odd ball because the controlled values are expected to be classes and not SKOS concepts like most other terms that use controlled values. I think that might be overkill if we just enumerate the recommended values as they are and could actually confuse people if they saw that field in the term definitions outside of the context of the class terms' use as values for dwc:basisOfRecord.
tucotuco commented 1 year ago

I second the motion to not change the examples and instead change the recommendations to be explicit about using controlled values from the vocabulary rather than labels wherever possible. In and before the 2017-10-06 version of Darwin Core, the labels were the same as the local name. These were changed (https://github.com/tdwg/dwc/issues/253) when Darwin Core was made to comply with the Standards Documentation Specification.

Jegelewicz commented 1 year ago

because the controlled values are expected to be classes and not SKOS concepts like most other terms that use controlled values

And if the MaterialSample Task Group recommends deprecation of one or more of those terms - what happens?

dr-shorthair commented 1 year ago

the controlled values are expected to be classes and not SKOS concepts like most other terms that use controlled values.

I don't think you should read too much into implementation choices like skos:Concept vs class. In my experience skos:Concept is often a 'view' of what other people might model as a class.

And in other cases, instances of skos:Collection appear where other modelers might have put a class. (The extension of a class is its members; a skos:Collection has members; ergo they behave like classes.)

It is also relevant that set-based formalisms like OWL/RDFS typically only handle one application of viewpoint and are not very nimble when it comes to more than two meta-levels, so it is unsurprising to find ourselves getting tangled up here.

baskaufs commented 1 year ago

because the controlled values are expected to be classes and not SKOS concepts like most other terms that use controlled values

And if the MaterialSample Task Group recommends deprecation of one or more of those terms - what happens?

Well, term deprecation has been (thankfully) rare in TDWG vocabularies. We've tried to avoid it because it's disruptive to existing implementations. If a term is deprecated, we would recommend to people what they should use instead. But there is no way to enforce that, and there would be a lot of legacy data that would still use the old term. So applications would have to continue to support the older (deprecated) terms for some time.

baskaufs commented 1 year ago

the controlled values are expected to be classes and not SKOS concepts like most other terms that use controlled values.

I don't think you should read too much into implementation choices like skos:Concept vs class. In my experience skos:Concept is often a 'view' of what other people might model as a class.

And in other cases, instances of skos:Collection appear where other modelers might have put a class. (The extension of a class is its members; a skos:Collection has members; ergo they behave like classes.)

It is also relevant that set-based formalisms like OWL/RDFS typically only handle one application of viewpoint and are not very nimble when it comes to more than two meta-levels, so it is unsurprising to find ourselves getting tangled up here.

Yes, I understand that it's not uncommon for some people to model things as a class and others to model the same things as concepts. However, we've also tried to keep things as simple and consistent as possible within TDWG vocabularies to avoid confusing users and to stick to patterns that are predictable. Thus far we've managed to maintain a distinction between classes, properties, and concepts in our terms. There may be a point in the future where we would give that up because it were necessary (e.g. if we really needed to use SKOS relationship properties with classes) but we've avoided it so far.

tucotuco commented 1 year ago

This proposal has been updated to reflect the apparent consensus in the commentaries thus far with a note on the change from the original proposal.

jbstatgen commented 1 year ago

+1 for the proposal.

Is the reason that it is "basisOfRecord" and not "BasisOfRecord" the above discussion about classes versus SKOS concepts?

Since vocabularies are mentioned, a question about one of the terms in the examples: Is MaterialSample always non-biological? That is, a water, rock, meteorite sample, etc. Differentiated from a wolf in a zoo (LivingSpecimen), a pressed plant on a sheet, a dried clam shell in a drawer (both PreservedSpecimens as whole organisms or parts of them), a fossilized tree trunk (FossilSpecimen).

I'm compiling terms/entries used and suggested in different places and by different groups as work in progress towards the development of vocabularies and added the the terms from basisOfRecord. Already present is non-biological and I wonder if I can set it as equal to MaterialSample.

tucotuco commented 1 year ago

@jbstatgen MaterialSample was actually added to cover biological samples for DNA extractions and the like. The primary motivation for it was to convey the sampling aspect and allow derivation chains to be shared. So no, MaterialSample is definitely not non-biological.

The term name is basisOfRecord following the Darwin Core convention for naming a property rather than a class, which would be UpperCamelCase (e.g., MaterialSample).

jbstatgen commented 1 year ago

@tucotuco Thanks a lot for the clarifications. MaterialSample and non-biological have found their independent places in the vocabularies. dwc:basisOfRecord is a property in the dwc:Record class.

tucotuco commented 1 year ago

@jbstatgen I am curious about this reference to dwc:Record. No such class exists in the Darwin Core namespace.

Jegelewicz commented 1 year ago

I assum that @jbstatgen is talking about https://dwc.tdwg.org/terms/#record-level

tucotuco commented 1 year ago

I would expect that too, but in any formal setting, it is important that dwc:Record does not exist.

jbstatgen commented 1 year ago

Thanks @Jegelewicz for pointing to the correct place.

I had decided to go for a round of sport instead of replying directly. Things are never that straightforward, are they.

Standard-wise and interestingly, there doesn't seem to be a class RecordLevel, just some category and a bunch of freely floating-around properties. Aha, I thought, DwC is a "bag of terms" there are no classes. Scrolling further down the Darwin Core quick reference guide quickly(!) showed that the "bag of terms" has some structure to it. Certainly, there will be [esoteric, practical, historical, ...] reasons for the class dwc:RecordLevel not to exist. - And maybe I'm still completely wrong.

@tucotuco I guess if our positions had been reversed and I had been in your position and answered, it would have been in the way of @Jegelewicz , pointing to the correct category/term. Potentially, I might even have given some context, background, and maybe explained that I had done that or a similar mistake myself in the past, too. In this case here, I likely would only have asked if you didn't mean the "record level" category.

Instead, based on the given comment, I was wondering if I should feel that I had been outed as incompetent, sloppy and lazy. Heartily slapped and educated about my place in the world, I should better hide in some dark corner and not speak up. How did I ever dare to go eye-level with the big boys and their 20+ years of solid expertise.

The detour to the fitness studio was due to me wondering if thinking about 'chromosomes with significantly reduced recombination rates and their direct and indirect consequences' was a socially acceptable enough (well, hopefully more mature type of) defense mechanism.

Maybe a second round of training would have been better ....

Having attended the meetings of the CBD bodies, plus information webinars etc. over the past two years, I came away from COP15 with a couple of observations that were directly to more indirectly connected to collections and data infrastructure building, and suggested some general structural problems.

The hypothesis that has formed from these observations and experiences over the past months is that infrastructures are about people and their social relationships. Technical infrastructures seem to be always only as functional as the social communities that build them. They likely show similar characteristics in certain ways, and inform about the social environment in which they were build and are situated.

I enjoy the technical aspects of standards. Yet, it are the social interactions that we are able to build that really interest me and that I find of core importance.

tucotuco commented 1 year ago

I apologize for the way I responded given that it made you feel the way it did. I had no such intention. I was simply trying to be as clear and concise as I could with a point of information.

jbstatgen commented 1 year ago

John, I can understand your approach, even if it didn't work for me in this situation. I'm not a full-time standard/ontology developer, and rather new to formal standard development. All of what I do here I have to do on the side with deadlines piling up elsewhere. Sometimes I don't have the leisure to dive into the background to triple check that I avoid mistakes or to correct mistakes. In such situations, when mistakes nevertheless require me to dive into the background because they have to be corrected, mistakes become very costly. When such situations accumulate, I won't be able to continue to contribute. At the same time, mistakes are how we learn and new solutions can emerge. Thus, if they become too costly for too many members of the community, development over time will stall and old, potentially outdated structures can't be overhauled and renewed. This is a general observation and not necessarily saying something about Darwin Core's structure, I simply don't have the experience and overview for an assessment. Still, see below.

In the overall context I and likely we are finding us in, I appreciate it when small edits come with sufficient information to quickly correct and update. That will allow me and everybody to focus on the larger tasks that are on the table.

One of those larger tasks is that it seems to me that most of our discussions in the Material Sample group come back again and again to questions of structure and meaning/reasoning. Maybe it is time to consciously start to explore what a semantic layer will require and what such development will mean and need.

We might find out that as TDWG/Darwin Core community we have already quite a lot of building blocks in place and are much further along than we think. Or an assessment will show that without a dedicated group of sufficiently diverse full-time developers it's just not feasible. Likely, reality will be a mix of both.

tucotuco commented 1 year ago

One of those larger tasks is that it seems to me that most of our discussions in the Material Sample group come back again and again to questions of structure and meaning/reasoning. Maybe it is time to consciously start to explore what a semantic layer will require and what such development will mean and need.

We might find out that as TDWG/Darwin Core community we have already quite a lot of building blocks in place and are much further along than we think. Or an assessment will show that without a dedicated group of sufficiently diverse full-time developers it's just not feasible. Likely, reality will be a mix of both.

Amidst the 2021 Darwin Core public review the related open issue https://github.com/tdwg/dwc/issues/302 was labeled as requiring a Task Group. Early on in the Material Sample Task Group we talked about scope and whether an attempt should be made to try to solve both at the same time. I think the prospect was deemed too large at the time.

Since July 2021 I have been working on the GBIF Unified Model through an open process based on real-world use cases that are targeted at solving current challenges to data-sharing, aggregation and use. From the document describing that project:

"With successful prototypical implementations covering a subset of carefully-reviewed use cases, we are confident that there will be sufficient accumulated practical experience for the Darwin Core Maintenance Group to form a Task Group to foster changes in the Darwin Core standard in support of putting this work into practice backed by a standards framework. In addition, we hope that the Unified Model emerging from expert public review and implementation of a broad range of use cases can contribute useful experience to inform the basis of a TDWG-wide semantic model."

Work on that project continues. With each additional use case, the model gets tested and refined to meet the new challenges presented. Though many parts of the model seem to be solid under testing, other parts seem to have many different possible solutions and it isn't yet clear if there is one best way. I think the Unified Model work has an immense amount to offer toward a "TDWG-wide semantic model". Some of it, such as eventType and MaterialEntity, could be immediately useful for Darwin Core as well. As for this Task Group, I think it comes down to the question of scope again. So far, every time we have addressed the question of semantics, the answer has been to stay focused so as not to derail the work that can be immediately useful.