tdwg / tnc

Taxonomic Names and Concepts Interest Group
22 stars 7 forks source link

The need for "intersects" as a TNU relationship type in addition to the five RCC-5 types #45

Closed camwebb closed 4 years ago

camwebb commented 4 years ago

TaxonNameUsages (cf. TaxonConcepts) are classes or sets, and can be related to each other in the standard set relationship ways: 1. congruent with, 2. superset of, 3. subset of, 4. overlaps, and 5. does not overlap (Franz and Peet 2009). In the context of Taxon Concepts, these have been referred to as RCC-5 articulations (Franz et al. 2015). Note that the term 'overlaps' implies that there exist both members of set A that are not in set B and vice versa.

When encoding real world cases of the relationship between TNUs, I have often found that there is insufficient information in a taxonomic publication to determine the exact RCC-5 term to use. For example, an author may simply say that a specimen that is now named as TNU2 was previously named as TNU1, without explicitly stating that, e.g., all members of TNU1 are now called TNU2, with other additional members (which would be TNU2 > TNU1). All we know for certain in these cases is that there is at least one member in common between TNU1 and TNU2, i.e., not(TNU1 | TNU2). But we do not yet have a term for not(TNU1 | TNU2).

I propose that we need a new term intersects to meet the practical goal of recording information about TNU relationships. The property intersects is a superproperty of congruent with, superset of, subset of, and overlaps.

(PS: @nielsklazenga asked me to open this issue to generate some discussion; we tentatively already added intersects to the GDoc.)

deepreef commented 4 years ago

I agree! and I like the term intersects for this purpose.

And on a related point, it's a bit fuzzy what sort of instances comprise these TNU classes. TNUs (cf. TaxonConcepts) can be represented as sets of other TNUs (TCs), or sets of individual Organisms, or some combination of both. Whether this matters, and how to address it if it does matter, are outside the scope of my expertise on this branch of information modelling.

jar398 commented 4 years ago

As long as this doesn't put us on a slippery slope toward including the 26 other RCC-32 relationships...

camwebb commented 4 years ago

@deepreef We just finished a conf. call chatting about publishing the standard, and the need to write very clearly and specifically about 'What is a TNU?', 'How does a TNU relate to previous concepts of a Taxon Concept?', etc. Your comment about what are the possible instances of a TNU class will need to be addressed (and no doubt will still need some further discussion!).

deepreef commented 4 years ago

Thanks, @camwebb . I've been mostly "off grid" for the past couple of months (for a variety of reasons), so my apologies for not being more engaged. Emerging now back into the world of internet. I'm certainly happy to help with this process in any way I can, going forward.

And I agree with @jar398 concerning the "slippery slope" issue. But as @camwebb notes, without a superproperty of congruent with, superset of, subset of, and overlaps, in the absence of knowledge determining which of these is correct (i.e., the vast majority of cases), I imagine people will be forced to select one of them (thereby asserting false precision). Moreover, I imagine there will be a lot of inconsistency in which one they select. For example, there was a time when I thought about defining overlaps as a generalized term that could mean any one of congruent with, superset of, subset of, or "overlapping but with excluded bits on both". However, that would rob us the ability of explicitly asserting "overlapping but with excluded bits on both" (and I imagine would also clash with more conventional uses of the term overlaps); hence my support for accommodating/defining intersects.

camwebb commented 4 years ago

@jar398 Wow - I had never looked up the meaning/context of Region Connection Calculus. Perhaps borrowing from RCC was unwise... RCC is spatial in context, and I'm not sure what it could mean taxonomically for two TNUs to be tangential?

(BTW, I couldn't find a RCC-32, but only 8, 23 and 62.)

deepreef commented 4 years ago

I'm not sure what it could mean taxonomically for two TNUs to be tangential?

Sister-species?

:-)

jar398 commented 4 years ago

I think borrowing from RCC-5 is fine; David Thau, Nico Franz, and others have been using it in taxonomy with success, although I agree we often don't have enough information to enable a choice. The richer RCC's are probably unsuitable. When I said RCC-32 I was referring to the R_32 lattice described in this paper of David's:

http://learningsite.com/resume/papers/edbt-2008.pdf

really just the power set of RCC-5, capturing uncertainty as to which RCC-5 relation applies.

Jonathan

On 1/14/20 6:10 PM, Cam Webb wrote:

@jar398 https://github.com/jar398 Wow - I had never looked up the meaning/context of Region Connection Calculus https://en.wikipedia.org/wiki/Region_connection_calculus. Perhaps borrowing from RCC was unwise... RCC is spatial in context, and I'm not sure what it could mean taxonomically for two TNUs to be /tangential/?

(BTW, I couldn't find a RCC-32, but only 8, 23 and 62 https://pdfs.semanticscholar.org/f52e/2dbd778c382182b07973912cb0bc678a311e.pdf.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/tnc/issues/45?email_source=notifications&email_token=AAPRBEVEFLUBI6Y27YHQRWLQ5ZA6TA5CNFSM4KGZO2FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6OS2I#issuecomment-574417257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPRBEXV7SMB53RR3X5RBTLQ5ZA6TANCNFSM4KGZO2FA.

nielsklazenga commented 4 years ago

I don't think it is so hard to see what 'tangential' could be in the taxonomic sense. It could just be the relationship of a subspecies and its parent species. Mostly we don't have enough information, but it might work in the context of an ordination. What it actually means in taxonomy is a different matter. I am also not too clear about that.

I think whether or not we borrow RCC-5 is more a question of the language we use. The terms from the TCS Taxon Concept Relationship Types we decided were worth keeping coincide with the RCC-5 terms (and it is quite handy that they do). Earlier publications, like Koperski et al. (2000), never mention RCC-5 but use the same terms. In the sense of the Standard Documentation Specification we should probably not "borrow" the RCC-5 terms, but rather have our own concepts that reflect types of horizontal relationships between taxon concepts and indicate that, in the RDF/OWL sense, some of them can be considered equivalent with the RCC-5 concepts, so they can be used to do the sort of reasoning Nico et al. have been doing (and we would like aggregators of taxonomic data to be able to do). Just like our Taxonomic Name Usage can be considered equivalent to the Darwin Core Taxon, which doesn't mean that they are necessarily exactly the same thing, but does enable us to link occurrences to taxonomic name usages.

nielsklazenga commented 4 years ago

For background, when we reviewed TCS last year, we decided to restrict the usage of the Taxon Concept Relationship Assertion class to horizontal relationships between taxon concepts. Most of the terms in the Taxon Relationship Type enumeration have been either accommodated elsewhere in the standard (e.g. parent-child relationships and vernacular names), left out until use cases presents themselves on the basis of which we can decide where in the standard they can be best dealt with (e.g. hybrid and ambiregnal relationships), or considered superfluous (e.g. negatory terms like 'is not included in' and 'does not include').

The relationship types we were left with, 'is congruent to', 'includes', 'is included in', 'overlaps' and 'excludes', neatly coincide with the RCC-5 terms.

As @camwebb already indicated, the TNC core membership has already agreed with his suggestion to add the intersects relationship type in a teleconference. However, in today's teleconference we agreed that for every term we added for which there is no equivalent in either TCS or Darwin Core (this is pretty much it), we should open a GitHub issue, so we can open the discussion to a wider audience and so we have a record of our decision making.

deepreef commented 4 years ago

I tend to agree, but technically excludes is a bit of a wildcard (not too different from does not include), in that there are an almost incomprehensibly large number of cases of applying it. In my mind excludes is the default state, unless something else is indicated. I remember we discussed this -- did we come up with any explicit examples for when it would be useful to incorporate that kind of relationship assertion? I imagine cases where it could be useful, analogous to conducting a biodiversity survey and explicitly recording the fact that a particular species was not seen, even if it was expected to have been seen.

baskaufs commented 4 years ago

I just wanted to mention that we should probably take a look at Section 2.7.4 of the Darwin Core RDF Guide at some point. It basically punts on the issue of how to handle "taxon concepts" in anticipation of the work that this group is doing now. What it does is mint the term dwciri:toTaxon that is intended to link a dwc:Identification instance to "taxonomic entities" (taxon concepts, protonyms, taxon name uses, etc.) defined elsewhere. When TNC is done, then Section 2.7.4 should be rewritten to reflect what comes out of the TNC work.

Note that Section 2.7.4 says that dwciri:toTaxon links a dwc:Identification instance to a "taxon" (or whatever), not a dwc:Occurrence to a "taxon". That's assuming that there may be one to many identifications of the organism documented in the occurrence. As you know there is not yet a consensus graph model for biodiversity entities, so exactly what a dwc:Identification connects to on the other side is up for debate. Darwin-SW would say dwc:Organism.

deepreef commented 4 years ago

Darwin-SW would say dwc:Organism.

That's exactly how we have modeled it.

TNU<-Identification->Organism[<-Occurrence[<->Evidence]]

I think our "Evidence" corresponds to dsw:Token.

ghwhitbread commented 4 years ago

I wish. Despite 30 years of asking curators/taxonomists to link determinations to taxonomic usage, it never happens. The arc on determination is always to name. In fact, determinations by working taxonomists could be TNU.

deepreef commented 4 years ago

In fact, determinations by working taxonomists could be TNU.

That's exactly how I deal with it. TNUs don't only exist in publications -- any form of documentation (including a determination slip or record) can serve as the "Reference" basis of a TNU. A "Reference" is very liberally defined as any documented date-stamped assertion by agent(s). Both the agent(s) and the date-stamp can be broad (e.g., "Generic Museum Staff", "1753-2020"), and "documented" can be a small slip of paper or an email or anything of that sort. Once you have a Reference and a Protonym, you have a TNU.

nielsklazenga commented 4 years ago

Conceptually, I have no problem with identifications being TNUs, but to preserve everybody's sanity, I think it will be good to keep them separate. I think we should take up @baskaufs suggestion and include the Identification class from the Darwin Core RDF Guide in our discussions.

Despite 30 years of asking curators/taxonomists to link determinations to taxonomic usage, it never happens. The arc on determination is always to name.

In my mind, determinations can only be to taxa, but, indeed, nine (or ten) out of ten times, the name is all you get. In the data model of our collections database, the Determination table links to a Taxon table, but our Taxon is probably the same thing as the Plant Name in IBIS. It's now aggregators who link determinations to TNUs. In the Darwin Core RDF Guide, dwc:scientificName is a convenience term that is used for an Identification resource. The dwciri:toTaxon is now essentially a Taxon Relationship Assertion made by an aggregator. It would be so much better if aggregators wouldn't have to make this extra assertion, as determiners are in a much better position to do it. I think this is the other side of the equation. We can work on that after ratification of the standard.

I completely agree with @deepreef regarding excludes. If we can't come up with a use case, I would be happy to ditch it. Was going to open a new issue, but I think excludes is the opposite of intersects, so it is probably best to keep it in this issue.

deepreef commented 4 years ago

Having millions (or billions) of TNUs is not a problem, as long as there are sufficient properties to allow filtering. In this case, the property of relevance (i.e., determination slip vs. publication) is a property of the Reference, not the TNU itself, so outside our scope in this conversation. Organism instances need to be assigned to taxa via Identification instances anyway, so those records have to exist. The right thing to link an Identification instance to is a TNU, so I think we should proceed accordingly. In the 10%(?) of cases where Identifications are linked to publication-based TNUs, then the model accommodates it elegantly. In the other 90% of cases the "empty" ("vacuous"? "anemic"?) determination-based TNUs at least serve the purpose of linking an Organism for a Protonym, which is most of the need anyway (conceptually the same as linking an Organism to a Name).

So, for example, we have John Smith identifying Specimen 1234 as "Aus bus". A TNU is minted for Aus bus sec Smith. We have no other meaningful information for this TNU because it's associated Reference is a slip of paper attached to the specimen (i.e., determination slip). At the very least, the TNU for Aus bus sec Smith is linked to the Protonym (e.g., Aus bus Jordan 1920). So now we've connected Specimen 1234 to the Protonym of Aus bus. From there we look up the treatment of a major meta-authority like CoL to see what they think the Protonym for bus should now be treated as. There we find that the meta-authority follows Jones 1997, who considered Aus bus a junior synonym of Xus dus (Linnaeus 1758) That is, the TNU for Aus bus Linnaeus 1758 sec Jones 1997 links to Xus dus (Linneaus 1758) sec Jones 1997 as the valid taxon.

We may never know what Taxon Concept John Smith had in mind when he identified Specimen 1234 as "Aus bus", but at least we now have the potential for discovering Specimen 1234 in the context of Xus dus. You don't get that if Specimen 1234 is simply linked to the text string "Aus bus", nor do we get it if Specimen 1234 is linked via a dwc:Identification instance, that itself is linked to some non-usage-based "Taxon Name" listing.

Obviously, we can be much more confident in bridging the gap between John Smith's identification of Specimen 1234 to CoL's treatment of Xus dus if we have a more robustly defined taxonomic concept associated with the TNU for Aus bus sec Smith. But even without that, we know:

  1. John Smith believed that Specimen 1234 is conspecific with the holotype specimen of Aus bus Jordan 1920.

  2. Jones 1997 believed that the holotype specimen of Aus bus Jordan 1920 should be considered conspecific with the holotype specimen of Xus dus (Linnaeus 1758).

  3. CoL trusts the opinion of Jones 1997.

  4. We trust the opinion of CoL.

The weakest link in the chain is how much we trust that John Smith didn't grossly misidentify Specimen 1234 in his evaluation that it should be regarded as conspecific with the holotype specimen of Aus bus Jordan 1920. But that link is weak no matter how we process the data. And there are many potential pitfalls. But what's important is that we can bridge the gap between our taxon of interest (Xus dus (Linnaeus 1758)) and Specimen 1234 -- which in the vast majority of cases we cannot do now.

I'll gladly take "a hell of a lot better than what we have now", even if doing so also means "not quite perfect".

nielsklazenga commented 4 years ago

That sounds exactly like what we have now, not something that is "a hell of a lot better than what we have now". What exactly do you gain by confounding Identifications and TNUs? I think we would gain a lot more if determiners would link an identification to a TNU that provides more context, like a flora or a field guide or a checklist.

It really depends on what standard you want to use. I am not for a moment suggesting we should have an Identification class, or borrow (sensu SDS) the Darwin Core Identification class. If you want to deal with identifications in TCS/TNU, they are Taxonomic Name Usage instances; but, if you deal with them in Darwin Core, they are Identifcation instances.

deepreef commented 4 years ago

I think we're not quite communicating here. I don't know what you mean by "confounding Identifications and TNUs". dwc:Identifications are a different class of entity than a TNU (the former explicitly involves an Organism, the latter does not).

Yes, of course we would gain a lot more if determiners would link a dwc:Identification to a TNU that is anchored to a publication. But as you and @ghwhitbread pointed out, most specimens and other Organism instances do not include this level of information, and it would require a RelationshipAssertion to bridge the gap.

My point was that a dwc:Identification linked to a vague "taxon name" or text string name, or anything other than a TNU robs us of the cross-link to a Protonym. Having the Protonym link ties us into the entirety of other TNUs linked to the same Protonym.

We certainly don't need to replicate dwc:Identification. But we need to make sure that out TNU instances are broad enough in scope such that all dwc:Identification instances can link to a TNU. That includes the 10% (your number) that are based on publications, as well as the 90% (?) that are not.

nielsklazenga commented 4 years ago

Ah, I see what you mean. If you want to link everything up nicely in the same system, you would have to do something like that. I think we're fine with the scope of the TNU. It is more the lacking context that is an issue, which is not different from some TNUs from literature.

deepreef commented 4 years ago

Right -- it doesn't have too much direct impact on what we do, because the "scope" in question is not about TNUs per se, but rather the scope of References that can serve as the basis for TNUs. My only point is that, at least from the TNC side of things (documentation, etc.), we should not impose or imply any real constraints on that scope of References (i.e., we should be clear that TNUs do not only exist within publications, but can exist in a wide variety of forms of documentation).

camwebb commented 4 years ago

I thought it would be informative to try to model @deepreef's example as RDF (Turtle). See this gist. Image (via dot):

image

Discovered:

  1. In general our new terms are working well,
  2. We may need to add a class of assertions about the accepted/synonym status made by a third party,
  3. Had to revert to TCS VOC to make a simple synonym statement.
camwebb commented 4 years ago

On another point, I think it is important to keep excludes. @deepreef says "In my mind excludes is the default state" but... surely 'no data' is the default, and excludes is very informative. In the same way that a '0' in an ecological site-by-species occurrence table is very different from a 'N/A'. A use case example would the the usage of the same specific epithets by two authors, in parallel and without communication, for completely different taxa. Surely it's important to be able to succinctly say that these TNUs do not intersect?

deepreef commented 4 years ago

surely 'no data' is the default, and excludes is very informative

Yes, of course you're right! I guess I should have said is the "default assumption", not the default value. I definitely get the ecological site-by-species (explicit assertion of absence), but I was trying to think of a taxonomic example. Probably the best example would be in cases of misidentification. For example, if a TNU represents a misidentification (which I define as a case where the person making the identification would not have included the type specimen of a name within the implied taxon concept of that name by that person), then in many cases it may be valuable to explicitly document an "excludes" relationship (e.g., Aus bus L. sec Smith excludes Aus bus L. sec Jones).

Many thanks also for the graph! I'll spend some time looking through it this evening.

deepreef commented 4 years ago

Just to clarify a bit more about what I meant by "default assumption". When you record '0' in an ecological dataset, you do so because there is some reason to expect that species to have been recorded within the survey, but it was not observed. By explicitly including a '0' in the dataset, you're eliminating the possibility that the species was left off the list by mistake. Otherwise, in a ecological dataset consisting of 100 species, you don't add 1,999,900 additional '0' records representing every other known species on Earth that also wasn't observed in the survey. Likewise, I don't think we need to explicitly list every possible pair-wise combination of "excludes" TNUs. That could get out of hand....

Also, I just saw your use case for the taxonomy parallel, and I'm not sure I understand:

A use case example would the the usage of the same specific epithets by two authors, in parallel and without communication, for completely different taxa. Surely it's important to be able to succinctly say that these TNUs do not intersect?

If by "same species epithets" you mean homonyms, then there would be no confusion because they would be linked to different Protonyms (and informatically would be no different than any two random TNUs). But if you mean that two non-communicating authors independently using the same protonym-based epithet (i.e., both referred to Aus bus L.; not a case of homonym); then at least one of the two authors would need to have incorporated a misidentification. Otherwise, they would at the very least overlap at the type specimen.

camwebb commented 4 years ago

@deepreef Yes, I see your point(s) above. But I'm not convinced we will never need the ability to succinctly say that TNU1 does not intersect with TNU2. Another example, perhaps trivial, but perhaps also sometimes needed, is this: field botanist Smith consistently misidentifies a taxon/TNU as 'Aus bus Jordan 1920'. If we want to encode this information (and herein lies the question why would we?) we would want to say :AusBusJordan1920SecSmith2020 tnu:excludes :AusBusJordan1920SecJordan1920.

I'll give it some more thought and try to come up with a better use case.

nielsklazenga commented 4 years ago

I found two examples (stopped looking after that) in Koperski et al. (2000):

The first one is a misapplication of the name Campylopus introflexus by Moenkemeyer. In the second example Frahm and Frey (1992) explicitly exclude (the type of) Dicranella schreberi var. robusta Schimp. ex Braithw. from Dicranella schreberiana.

I am off to the airport soon, so won't respond for a few days.

deepreef commented 4 years ago

Sounds good! I think we are in agreement!

nielsklazenga commented 4 years ago

Back to linking identifications to TNUs and not restricting scope of references, I think we should leave dcterms:BibliographicResource (which we replaced the TCS Reference with) out of the standard document, so people can choose for themselves what type of object they use. For TNUs that are associated with identifications, oa:Annotation seems to me a good candidate.