tdwg / vocab

Vocabulary Maintenance Specification Task Group + SDS + VMS
11 stars 6 forks source link

DwC class hierarchy #26

Closed ramorrismorris closed 8 years ago

ramorrismorris commented 9 years ago

What if anything is permitted or prohibited by the current DwC about the DwC class hierarchy ?

On 8 Aug 2015, Bob Morris raised this on the TDWG-tag mailing list and this issue represents an attempt to move the discussion here.

(Bob Morris posting to tdwg-tag) : The current DwC Terms [1](carrying Identifier http://rs.tdwg.org/dwc/2015-03-19/terms/ and Date Modified: 2015-06-02) is confusing (confused? silent? ) about the relation of the "sister classes" to the terms intended (?) to be used therewith. For example, dwc:Event, dwc:MachineObservation, and dwc:HumanObservation could reasonably(?) all have dwc:eventDate applied to them. But http://rs.tdwg.org/dwc/terms/index.htm#eventDate suggests Class=dwc:Event.

Now it's human-clear that for each of the "multiple" boldface lines in the index, the second and subsequent terms are meant to be special cases of the left most one. What I can't see is whether [1] intends to encourage (require?) this in some explicit way, and where that explicit way is to be found. (I had a dream that the DwC RDF Guide might take a position....) .

Thanks.

--Bob p.s. I concede that some answers might lurk in the ongoing move of resources to http://tdwg.github.io dwc/terms/. My second dream was that tdwg.github.io does Content Negotiation and curl would rescue me.... [1] http://rs.tdwg.org/dwc/terms/index.htm

ramorrismorris commented 9 years ago

@baskaufs responded in tdwg-tag mailing list thus:

I think that this has been left intentionally vague because at this point we don't have well-defined relationships among Darwin Core classes. It seems to me that placing the class terms on the same line are a clue that they are somehow related, but it isn't apparent to me that the subsequent terms re always special cases of the left-most terms. I can provide an example where a LivingSpecimen isn't a MaterialSample because it was never collected. I can also imagine Event instances which include many HumanObservations (i.e. Event serves to group observations, not to serve as a superclass for observations). There have been various attempts to lay out how the Darwin Core classes are related to each other. But I'm not aware that there has ever been a consensus on it. That's why the RDF Guide didn't touch the issue. We were afraid that the guide would never be finished if we took up that subject. I think it would be an excellent exercise to try to lay out how Darwin Core classes are related to each other. But first, I would suggest that we lay out the use cases that we intend to satisfy by nailing down those relationships, then show how establishing those relationships help us. For example, I could suggest that we establish that dwc:HumanObservation rdfs:subClassOf dwc:Event. But what would I gain by doing that? What would it prevent me from doing? Steve

ramorrismorris commented 9 years ago

@richpyle responded in tdwg-tag: It sounds like perhaps now is the time to focus on the ontological relationships among classes, as the next major focus of DwC advancement.

I would NOT regard "HumanObservaton" as a subclass of "Event". I believe that "Event" should be kept clean as fundamentally an intersection between a Location instance and a point in time. After much thinking and testing on this, we've finally come to the conclusion in our data models that "Location" is defined by two of the four space-time dimensions (X & Y; effectively represented as Geocoordinates -- whether as a point, Point/radius, track, polygon, etc.), and "Event" is defined by one Location instance plus the other two space-time dimensions (Z & T; effectively represented as elevation, depth and date/time). I'd be happy to explain why we came to this conclusion, but that's another thread.

The point is that I think "Event" should remain as an abstract four-dimensional address, created as an instance to capture space-time information for something else. In DwC, that "something else" is an Occurrence. Within the confines of existing DwC, "HumanObservation" comkes closes to being a subclass of Occurrence. However, there is still one missing class that I believe we need to complete the core ontology space of DwC -- which is what we refer to as "Evidence", and Darwin-SW refers to as "Token" (https://code.google.com/p/darwin-sw/). Our model draws the lines slightly differently from the diagram for D-SW, but in general they represent a convergence of thinking on the relationships among DwC classes. In answer to Steve's question:

dwc:HumanObservation rdfs:subClassOf dwc:[Occurrence]

But what would I gain by doing that? What would it prevent me from doing?

I'm not technically savvy enough to answer that question from an implementation perspective; but from a DwC comprehension perspective, it moves us a step closer to mutual understanding of how to transform DwC content into a functional data model. We all kinda/sorta know that already, but as evidenced by the different perspectives of "HumanObservation as a subclass of Event" vs. "HumanObservation as a subclass of Occurrence" just now revealed & expressed, it probably wouldn't hurt to be more explicit about these sorts of things in DwC documentation.

ramorrismorris commented 9 years ago

In off list email, I responded to Steve: I think your examples are permitted by the current DwC. For example, you wrote:
" I can provide an example where a LivingSpecimen isn't a MaterialSample because it was never collected." but where is it required that a MaterialSample must in some way be associated with a collection event? An implementer of an ontology purporting to meet DwC may, or may not add an axiom to that effect, and neither adding nor omitting such an axiom would seem to contradict the current DwC. Put another way, I would say that your examples are subject to exactly the question I raised about avowing the subclassing in the boldface lists. It would seem that I can avow the subclassing and omit avowing the axioms lurking in your examples, and still remain conformant. My initial conundrum was that for (things like) dwc:eventDate, DwC is(?) presently silent on whether it can be used with dwc:HumanObservation. Well, of course it can due to the OWA. And yet, the DwC web document would seem to suggest that an assignment of dwc:eventDate to a dwc:Event must not lead to a contradiction, but that there is not a corresponding requirement for an assignment of dwc:EventDate to a dwc:HumanObservation.

This all arises in the context of the Plazi treatment.owl ontology I'm working on with Terry Catapano, for which owl DwC model I intend to follow Section 2.7.4 of the DwC RDF Guide. Some of the issues will disappear, but I still need to strive for conformance to DwC.

In an off-list response Steve responded: "I should have said that it was never the result of a sampling event (which in my head I was making synonymous with collecting - my bad)." and more importantly: "I hope I was clear in what I wrote that I wasn't suggesting that subclassing relationships should be declared between any terms. My point was that when people have made such suggestions in the past, they have not given any good reasons for it, nor explained how the semantics that they want to impose on the terms would be used to meet any particular use cases."

baskaufs commented 9 years ago

From an email response to the TDWG-TAG email list: Just to further elaborate on the example: If we assert:

dwc:HumanObservation rdfs:subClassOf dwc:Event.

and then someone stated:

<birdObservation1> rdf:type dwc:HumanObservation.

I think that would entail:

<birdObservation1> rdf:type dwc:Event.

because of the semantics of rdfs:subClassOf

On the other hand, if we assert something like

dwc:HumanObservation skos:narrower dwc:Event.

that would NOT entail

<birdObservation1> rdf:type dwc:Event.

but instead would entail different stuff like:

dwc:Event skos:broader dwc:HumanObservation.
dwc:HumanObservation skos:narrowerTransitive dwc:Event.

etc.

None of these entailments are intrinsically good or bad. But if we make assertions like

dwc:HumanObservation rdfs:subClassOf dwc:Event.

or

dwc:HumanObservation skos:narrower dwc:Event.

we must be aware that a machine could reason the entailed relationships, and should only make those assertions if we want a machine to be able to do those kinds of reasoning. In other words, we should only make assertions with semantic implications to accomplish some purpose related to machine reasoning, and not just because it seems like it might be a good idea. If our purpose is just to make things more clear to a human, then providing a better human-readable definition would be a better way to accomplish that.

Steve

deepreef commented 9 years ago

From an email post to the TDWG-TAG email list, pmurray responded to Richard Pyle:

On 9/08/2015 3:21 am, Richard Pyle wrote:

I would NOT regard "HumanObservaton" as a subclass of "Event". I believe that "Event" should be kept clean as fundamentally an intersection between a Location instance and a point in time. In which case, you would need a separate class for a thing that happens at a time, but which is not located (and perhaps, as the philosophers say, does not have an extension).

And this in a way is a reply to Bob's original question - why aren't these relationships explicit? The reason is that the second you try to make them so, you almost immediately start running into philosophical conundrums that people have been debating for thousands of years. The old "How many angels can dance on the head of a pin?" problem is an exercise in distinguishing between things that are located and things that are extended.

Something as abstract and basic as "thing that happens at a place and a time" should be borrowed from someone else's vocabulary. The first problem there is that if you do that, then if that other vocabulary defines inference rules, then anyone that uses your vocabulary must respect those rules or their ontology becomes inconsistent.

Another problem is that other people's vocabularies are never really quite exactly what you need.

And the underlying problem is that something as simple as "a point in time" is actually a really hard question. What about things that have a duration, that happen over a couple of weeks? Things that are cyclic? Points in time that you are uncertain or vaguely defined? What about the distinction between something that happens over a two week period, and something that is instantaneous but that is known to have happened over a certain two week period; or that is known to have happened at least once over that period; or that almost certainly will happen at some period in the future (eg, a scheduled observation)? The second you try to nail this stuff down, it immediately starts sprouting hair. It's intractable and perhaps not the kind of thing that biologists can best spend their time doing.

Here at biodiversity.org.au, our new APNI dataset is exposed using a custom made vocabulary (aside from things like dc:title) without much of an attempt to describe our objects in terms of well-known classes. Maybe at some point in the future we might be able to pull the TDWG, DwC, SKOS terms into the data set. But that cant be done if those terms are so strictly defined that our data in fact does not meet the strict definitions of those terms. As it stands, our ontology does not connect to other ontologies by way of the vocabulary in which it is described.

So is there any hope at all that we can create a distributed semantic web of facts relating to taxonomy - the output of the taxonomic work that people are doing)?

I think so, because the biologists and taxonomists are working on the same stuff, certainly the same kinds of stuff, within the constraints of the Real World™. And this maybe is a clue about where to look for useful vocabulary. Rather than attempting to solve age-old questions about the nature of time and thing, look to the specific subject matter.

After all - why have an 'event' class at all? The only thing you can do with such a class is to construct a query that asks, for instance, "tell me about everything whose foaf:person is Dr Joe Bloggs and that is a thing that happened". On the other hand, "specimen" in the strict sense of "something in a collection with an accession number" is very important, fairly specific to the subject matter, and entirely worth having a common term for.

Likewise, 'location' may mean 'geographic polygon', or it may conceivably mean a collection, or an institution. (How so? Because a location is anything that might be given in reply to the question 'where is X?'. Notice that this is language-dependent). Each of these three things has an existing vocabulary defined by geographers, librarians, and (I suppose) company registrars respectively. Does anyone really need a higer-layer linking them together?

Enough rambling, I think. FWIW:

deepreef commented 9 years ago

From an email post to the TDWG-TAG email list, Richard Pyle responded to pmurray:

A few quick comments (all transcribed to the GitHub site):

In which case, you would need a separate class for a thing that happens at a time, but which is not located (and perhaps, as the philosophers say, does not have an extension).

Not necessarily; at least not from the perspective of modelling biodiversity data. If you really wanted to capture a load of metadata about the "time" aspect (which one could actually do), I could see justification in establishing a class of object for the "Time" component, in the same way that we have a class for "Location". But fundamentally, we're just talking about coordinates for four-dimentional space-time; so really "Location" and "Time" might best be wrapped into the same "Space-Time" class (i.e., the "Where/When" class), and the Event would then become something along the lines of a "Who/What" class. But at some point, perfecting the conceptual data model represents an impediment to practical progress.

And this in a way is a reply to Bob's original question - why aren't these relationships explicit? The reason is that the second you try to make them so, you almost immediately start running into philosophical conundrums that people have been debating for thousands of years. The old "How many angels can dance on the head of a pin?" problem is an exercise in distinguishing between things that are located and things that are extended.

Agreed! The art all of this is in balancing the "Normalize until it hurts" exercise, with the "De-Normalize until it works" reality. There's no objectively correct answer. Just a cloud of possible options that we're gradually trying to collectively sharpen down.

Something as abstract and basic as "thing that happens at a place and a time" should be borrowed from someone else's vocabulary. The first problem there is that if you do that, then if that other vocabulary defines inference rules, then anyone that uses your vocabulary must respect those rules or their ontology becomes inconsistent.

Another problem is that other people's vocabularies are never really quite exactly what you need.

DEFINITELY agreed!!!!

And the underlying problem is that something as simple as "a point in time" is actually a really hard question. What about things that have a duration, that happen over a couple of weeks? Things that are cyclic?

Indeed! And it was a poor choice of words on my side to use the word "point". "Time" should be regarded in the same way that we regard the other three dimensions. That is, either in the form of an arbitrarily precise point with an explicitly stated error (as we do in DwC for geocoordinates), or in an explicit min/max range (as we do in DwC for things like elevation and depth), or in a way that handles real-world data (actual ranges, ranges representing imprecision/uncertainty, and multiple points within a scope bound by min and max values, etc.) We could deal with this stuff by treating "time" as a class, but in my experience (and yes, I actually did go down that road many years ago), the payoff ain't worth the investment.

So is there any hope at all that we can create a distributed semantic web of facts relating to taxonomy - the output of the taxonomic work that people are doing)?

Probably not. At least not in my lifetime (or career-time). But one can always hope! :-)

I think so, because the biologists and taxonomists are working on the same stuff, certainly the same kinds of stuff, within the constraints of the Real World™. And this maybe is a clue about where to look for useful vocabulary. Rather than attempting to solve age-old questions about the nature of time and thing, look to the specific subject matter.

Yeah....but.... we can't even get past the tired old arguments about "what is a species", and the difference between nomenclature and taxonomy (names and concepts), and even the word "name" has a staggering heterogeny of meanting even within the confines of biological nomenclature/taxonomy. It baffles me that the same arguments that were going on when Taxacom/TDWG were born, continue almost unabated today. But, like I said.... one can always hope! :-)

After all - why have an 'event' class at all? The only thing you can do with such a class is to construct a query that asks, for instance, "tell me about everything whose foaf:person is Dr Joe Bloggs and that is a thing that happened". On the other hand, "specimen" in the strict sense of "something in a collection with an accession number" is very important, fairly specific to the subject matter, and entirely worth having a common term for.

We've debated this one back and forth as well. But it always ends up in the same place. That is, all of these arguments ultimately end up with the same conclusion: everything should be reduced to a triple-store. Indeed, life would be so easy if that were actually practical. At this stage of human digestion of biodiversity data, software availability and consumer fluency, and a number of other Real World™ issues, however, it does not seem to be the path of least resistance towards progress.

Likewise, 'location' may mean 'geographic polygon', or it may conceivably mean a collection, or an institution. (How so? Because a location is anything that might be given in reply to the question 'where is X?'. Notice that this is language-dependent). Each of these three things has an existing vocabulary defined by geographers, librarians, and (I suppose) company registrars respectively. Does anyone really need a higer-layer linking them together?

And this is trivial compared to the analogous issues surrounding the term "taxon" (or "taxon name", or "taxon concept", etc.) Oi vey!

Enough rambling, I think. FWIW:

Likewise.

Aloha, Rich

baskaufs commented 8 years ago

OK, I've re-read this issue and I don't think that it is one that this task group is ever going to solve.

There are two aspects of this issue that have been captured by other work of the group so far. The draft Documentation Specification describes a structure for vocabularies that allows them to be built up from "term lists". This was designed to handle the existing Darwin Core structure where the text-based DwC vocabulary is built from DwC-defined terms plus borrowed Dublin Core terms, and the RDF DwC vocabulary is build from those two plus the DWC IRI terms. Such a structure would allow "semantic layers" to be added on top of that for those who are keen to link DwC terms to ontologies outside of TDWG, or who what to create the kinds of semantics that have been discussed above (subclassing, term ranges, disjoint classes, etc.). So I think the answer from the Documentation Spec would be "build your semantic layer, overlay it on the more basic layers, and see how it works". This kind of layered approach has been suggested at least several times before in tdwg-content email discussions.

The other question is captured in Issue #23. The IETF requires that there are two successful, independent interoperating implementations before a Proposed Standard becomes an Internet Standard. I think that a similar requirement should be considered for proposed semantic overlays for the basic TDWG vocabularies. What are they supposed to accomplish (use cases)? Do they work (satisfy the use cases)? Does anybody actually need them (successful implementations by more than one independent group)? There could be zero to many semantic overlays on a vocabulary, but there is no point in including them in the standard if they don't do anything useful or if there aren't at least two people who say they need it (something like John Wieczorek's criterion for moving forward on DwC term addition suggestions).

I'm going to close this issue because I don't think it's going to go anywhere that will help the task group advance and the issue of requiring successful implementations is captured in issue #23