tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
204 stars 70 forks source link

Change term - identificationQualifier #244

Open ianengelbrecht opened 4 years ago

ianengelbrecht commented 4 years ago

This is what is being evaluated

Change term

Current Term definition: https://dwc.tdwg.org/terms/#dwc:identificationQualifier

Proposed new attributes of the term:

This is how it began

Original comment: The current definition is 'A brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the Identification.' and the examples include the names following the qualifier. Suggested to update the definition to specify that the value provided should include the names following the qualifier and not just the qualifier itself.

pzermoglio commented 4 years ago

I propose a different view: keep qualifier independent of the actual name, but applied to the rank: "cf. rank", "aff. rank" e.g.: "cf. genus", "aff. specificEpithet"

The definition would then turn into something like:

In that way a reasonable controlled vocabulary could be built and used for this term (which would include the so many combinations of doubt terms/ranks. The ranks could be either those already available as existing DwC terms (but there are ranks that are not in DwC, eg varietas, subfamily), or pulled from a vocab for taxonRank.

@ianengelbrecht do you see any conflicts for this view?

qgroom commented 4 years ago

identificationQualifier can only refer to the lowest taxonomic level of the identification, because how can you know what the lowest identification is, if you're not certain of the higher level. I don't think there is a need to qualify this further as the identificationQualifier can always refer to dwc:scientificName.

If I am misunderstanding your point, perhaps you could provide an example.

ianengelbrecht commented 4 years ago

Looking at the examples the qualifier does apply to ranks higher that the lowest name in some cases, e.g. 'aff. agrifolia var. oxyadenia'. My suggestion was only to clarify the definition to make it consistent with the examples provided. Your suggestion might work @pzermoglio, but it means a different definition and different data in the field to what people may have been providing until now, and it might be difficult to work with if only scientificName and not any of the rank terms are provided as we can't easily identify which name part the qualifier belongs to without specifically parsing the scientificName value and getting the ranks of each part.

qgroom commented 4 years ago

The example is misleading, because I imagine the intention is to say that the organism has affinity with the variety oxyadenia. Even though the prefix is placed before the specific epithet. I'm not certain of normal taxonomic practice, but I doubt they mean that it has affinity only with Quercus agrifolia, because if this were the case why would they mention the variety.

I suggest changing the example.

ianengelbrecht commented 4 years ago

Okay yes agreed. My feeling then is that if it always applies to the lowest identified taxonomic rank then the names are not actually required and the field can just include 'aff' or 'cf'. Issue #181 has a proposal for verbatimScientificName so that could capture the qualifier at the specific location in the name where the identifier put it if adopted.

nielsklazenga commented 3 years ago

I prefer the way the Darwin Core identificationQualifier examples do this, i.e. giving the qualifier plus everything that comes after it. ABCD has identification qualifier insertion point, which is more like what @pzermoglio suggests. However, I think this is more about implementation than definition, so possibly better dealt with in a usage note, or not in the standard at all.

I agree with @qgroom that qualifiers for anything else than the lowest taxonomic level to which something is identified does not make sense, but that does not mean that I have never seen it happen. In fact, in the discussions we had at the time we replaced our collections database ten years ago our botanists insisted on having the option. I myself would never use identification qualifiers at all, but rather write whole essays in identificationRemarks.

dagendresen commented 3 years ago

Apropos, there was also a discussion on an idenfificationQualifier vocabulary at the HISPID GitHub (also concluding against a controlled vocabulary):

tucotuco commented 3 years ago

@pzermoglio You proposed an alternative treatment of identificationQualifier that isn't actually under consideration in this term change proposal, but for the sake of being certain about consensus would like to know if the proposal as it stands is acceptable.

EstebanMH-SiB commented 3 years ago

Usually if there is a doubt in the identification (e.g. Ficus cf. cuatrecasasiana) we recomend to the publisher to put the doubt in identificationQualifier ( cf. cuatrecasasiana) and just keep the part that has taxonomic certainty in scientificName (Ficus).

Aditionally, the use of "sp." in the examples, doesn't seem a wrigth use of this element becuase "sp" could be documented in the field verbatimTaxonRank, so there is no need to put it here.

We made this comment in behalf of @SiBColombia

tucotuco commented 3 years ago

@EstebanMH-SiB Your usage 'cf. cuatrecasasiana' is in agreement with the examples. I agree with you about the appropriate treatment of 'sp.', but we commonly see 'sp.' anyway.

In cases where the species has not yet been described but is recognized as being a new species, we see the pattern 'Genus sp. nov. X'. In these cases the 'sp. nov. X' should go in identificationQualifier. Would this would be a better example to add?

RicardoOrtizG commented 3 years ago

@tucotuco We agree to add 'sp. nov. X' instead of 'sp.' as an example. We think it will clarify the cases when an sp. has to be documented in identificationQualifier or verbatimTaxonRank.

pzermoglio commented 3 years ago

I agree that the qualifier should always apply to the lowest taxonomic rank -in spite of how it is actually used in many cases. What @nielsklazenga proposes of giving the qualifier plus everything that comes after would not be consistent across qualifiers, right?

Eg: "?", is used at the end, as in "Meristina furcata?", where what is in question is the species, lowest rank. So it would never have anything after "?". "cf.", instead, is used in the middle, as in "Quercus agrifolia cf. var. oxyadenia", where what is in question is the var, lowest rank. because of how we write it, it will always have something after "cf.".

I don't see why we would want to be inconsistent, what would be the advantage of having any name included in the field at all if we will always be referring to the lowest rank.

nielsklazenga commented 3 years ago

What @nielsklazenga proposes of giving the qualifier plus everything that comes after would not be consistent across qualifiers, right?

I did not propose this. It is what it is now, judging from the examples.

verbatimIdentification scientificName identificationQualifier
Meristina ?furcata Meristina furcata ?furcata
?Meristina furcata Meristina furcata ?Meristina furcata
Meristina furcata? Meristina furcata ?
Meristina cf. furcata Meristina furcata cf. furcata

In ABCD:

verbatim identification FullScientificNameString IdentificationQualifier IdentificationQualifier@insertionPoint
Meristina ?furcata Meristina furcata ? 2
?Meristina furcata Meristina furcata ? 1
Meristina furcata? Meristina furcata ? (?)
Meristina cf. furcata Meristina furcata cf. 2
tucotuco commented 3 years ago

This proposal has been labeled as 'Controversial'. It will remain open for public review in pursuit of a consensus solution for another 30 days, but will not be included in the release to be prepared from the public review of 2021-05-01/2021-05/31.

tammyhorton commented 3 years ago

I'm not sure if this is helpful but we wrote a paper on the use of open nomenclature terms in image-based identifications (where there are a lot of uncertainties. We gave examples of the use of identifier qualifier and how to input to Darwin Core, and this might be of some help here?

Horton et al 2021. Recommendations for the Standardisation of Open Taxonomic Nomenclature for Image-Based Identifications

https://doi.org/10.3389/fmars.2021.620702

We provide examples to standardise the terms used and indicated that the field should contain and also recommend use of the identificationRemarks field to explain the qualifier.

ianengelbrecht commented 3 years ago

Great to see another article proposing standardized use of Open Nomenclature terms. I hope there will be an opportunity to discuss identifications again at the upcoming TDWG working sessions after the conference, I feel this is something that needs to be adopted. I've been trying to advocate for standardized use of these terms in our community here in South Africa again, but I'm receiving quite substantive back-pressure still...

tucotuco commented 3 years ago

Public review of this issue has now concluded with objections to the proposed change. The issue will remain open for discussion and potential resolution.

ymgan commented 1 year ago

An identificationQualifier cf. var. oxyadenia for Quercus agrifolia cf. var. oxyadenia with accompanying values Quercus in genus, agrifolia in specificEpithet, oxyadenia in infraspecificEpithet, var. in taxonRank, and Quercus agrifolia var. oxyadenia in scientificName.

I have difficulty understanding this example above as this is not how I understood how it works. After discussing with @albenson-usgs and @pieterprovoost, these are the points that we find confusing:

The definition of scientificName states:

When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the identificationQualifier term.

For Quercus agrifolia cf. var. oxyadenia, does that not mean that var. oxyadenia is in doubt? Then why is var. oxyadenia in the scientificName?

The definition of infraspecificEpithet states:

The name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation.

If the definition of scientificName states that it should contain lowest level taxonomic rank that can be determined, then why is oxyadenia in infraspecificEpithet when it is in identificationQualifier?

Likewise, the definition of taxonRank states:

The taxonomic rank of the most specific name in the scientificName.

I am not sure if I understand why var. oxyadenia is in scientificName in this example and hence, I don't understand why var. is the taxonRank. Why not species in taxonRank?

If it is me not understanding how taxonomy works, please help me to understand this. Thanks a lot!

deepreef commented 1 year ago

@ymgan these are really good points. We wrestle with this all the time, and I would very-much value insights from others. Stated in slightly different terms, given the source text string:

"Quercus agrifolia cf. var. oxyadenia"

I could interpret it two different ways:

One is that the asserted taxon is Quercus agrifolia var. oxyadenia, and the identificationQualifier "cf." is added to say something to the effect of "I'm pretty sure this is the correct identification, but I'm not confident because it might be a different but similar taxon",

The other is that the asserted taxon is Quercus agrifolia, and the identificationQualifier "cf. var. oxyadenia" is added to say something to the effect of "Im highly confident that this is Quercus agrifolia, but it seems to share affinities with a more precise taxon for the variety oxyadenia within this species."

I would be inclined to parse it according to the first interpretation, so that it shows up for a search on the variety Q. a. var. oxyadenia, but then is flagged with identificationQualifier "cf." in order to cast some doubt on the confidence of the identification.

nielsklazenga commented 1 year ago

Likewise, I myself think of identificationQualifiers as identification disqualifiers, so I would choose for option 2 in @deepreef's comment above, but when we had the discussion in my own community, I was outnumbered, so option 1 is what we are exchanging. There is something to say for option 1 as well, It is just quite hard to get in the head of the determiner. That's why, as a determiner, I would avoid identification qualifiers altogether (certainly with infraspecific taxa like in this example), but the situation is of course entirely different if you have to enter a determination by someone else into a database.

I think this falls within the 'application scheme` realm, as individual communities might be closer to recorders and determiners in the community and also can set guidelines as to when and how to use identification qualifiers.

Notwithstanding the above, I think it is good to have this discussion, as it would be worthwhile to have a note about this somewhere in Darwin Core. Just noting that it would be much better to do this in a new issue, as this issue is about a proposal to change the definition of identificationQualifier (arguably a different term from the one we are talking about here) and is more than three years old.

deepreef commented 1 year ago

I agree it's impractical to get into the head of the person whom asserted the identification, but my preference for option 1 is more practical. Dave Remsen often spoke of "Recall" vs. "Precision". Very briefly, "recall" represents the inclusion of records of possible interest within the result set (i.e., reducing the linklihood of missing records of interest). "Precision" is minimizing the inclusion of records outside the scope of interest in the recordset.

My feeling is that option 1 supports "recall" in that a search for the epithet "oxyadenia" is more likely to return this record if the value is included as infraspecificEpithet than it would if that value were only represented in identificationQualifier; and the caveat on "precision" is provided by the inclusion of identificationQualifier:"cf" for the record.

Of course, this depends entirely on how the query logic is programmed... but it still seems to me that erring on the side of parsing out the infraspecificEpithet in such cases, rather than dumping it into identificationQualifier, slightly favors result sets that people would want. But of course, I'm probably just making this up.

I do think it's relevant to the discussion on identificationQualifier; whether or not a new issue is warranted is beyond my paygrade.

ymgan commented 1 year ago

Thank you very much @deepreef and @nielsklazenga! I appreciate the explanations! The reason I commented here is because I find the example seems to be contradicting the definition of multiple Darwin Core terms based on my understanding. I hope something could be done to clarify it under this term change proposal.

From the perspective of thematic portal (antarctic biodiversity portal and antarctic GBIF/OBIS node here) where the data is important for modelling and to provide ocean statistics, we tend to take the conservative approach to only include information that is certain. I believe our approach is close to what is described as "Precision" as well as @tammyhorton paper in this comment.

Back to the example, "Quercus agrifolia cf. var. oxyadenia". To me, both of the options are not certain that it is "Quercus agrifolia var. oxyadenia", even though now I learned that there are different degrees of confidence between the 2 options in the comment. This is how our current approach looks like for this example based on my understanding:

field value
scientificName Quercus agrifolia
identificationQualifier cf. var. oxyadenia
genus Quercus
specificEpithet agrifolia
infraspecificEpithet
taxonRank species
verbatimIdentification Quercus agrifolia cf. var. oxyadenia

I acknowledge that there will be loss of information and I do not know what is the best way to represent the information when different degrees of confidence in the identification could be important to some. I leave my comment here, hoping that at least, there could be some clarification in the term definition/comment. Thanks again!

deepreef commented 1 year ago

Thanks, @ymgan

To me, both of the options are not certain that it is "Quercus agrifolia var. oxyadenia",

Yes, I would agree with this. Indeed, I think in almost all cases, any non-null value for identificationQualifer should be an indication that the person who is making the identification was not necessarily asserting that exact identification. (Of course, there is no such thing as a "certain" identification, except for name-bearing type specimens -- but that's another topic.) I would be curious if there are are any values of identificationQualifer that would increase the confidence of the identification (relative to no value given for identificationQualifer); or if, indeed, all values for this term serve to reduce confidence in the indetification. As written, the definition implies that all provided values reduce confidence (i.e., "doubts").

This raises another issue: If we accept that values for identificationQualifier should apply to the name in the lowest taxonomic rank provided in the scientificName, then can one confidently accept all components of the identification above the lowest taxonomic rank as unqualified confidence? In other words, for use-cases such as those described by @ymgan -- where only high-confidence identifications are used -- would it be appropriate to incorporate the following logic in interpreting the data:

If any non-null value is provided for identificationQualifer, then disregard the name provided for the lowest taxonomic rank, and instead use the next-higher taxonomic rank as the confident identification

If this an appropriate logic, then it supports my first option for parsing values (i.e., going with only 'cf.' for the identificationQualifer, and 'oxyadenia' for the infraspecificEpithet). This logic would thus yield "Quercus agrifolia" as the confident identification. But if the entire text "cf. var. oxyadenia" was provided as identificationQualifer, and no value given for infraspecificEpithet (which I realize is not how the example is framed); then the above logic would result in a high-confidence identification of simply "Quercus".

My point is that we should craft the definition and associated descriptions for this term in such a way that they are explicit enough that data providers will be consistent, such that logic of the sort described above will work reliably. It seems to me that the Example for the "Quercus agrifolia cf. var. oxyadenia" should only include "cf." for the identificationQualifer, as the "var." part and the "oxyadenia" part are already represented separately, and shouldn't be repeated within the identificationQualifer value. That would also be consistent with the definition, "A brief phrase or a standard term..."

tammyhorton commented 1 year ago

An interesting discussion!

We have been interpreting this as I stated in the paper mentioned above, and as @ymgan indicates. The scientific name is used according to the definition of scientificName that it should contain 'the name in lowest level taxonomic rank that can be determined', and any text following after this scientificName, is placed in the identificationQualifier field - so, in this case, "cf. var. oxydenia". i.e. We know this is Quercus agrifolia, but we need further information to confirm if it is of the variety oxyadenia.

Identification remarks are used to explain the reasoning and we are encouraging this to be used, although it is often lacking. In our work, the majority of names are accompanied by an identification qualifier of some sort, usually stet. or indet., but we can be confident in the scientificName value as the lowest level taxonomic rank that can be determined with certainty (as certain as ANY identification can be - As @deepreef indicates!).

The examples given for the term in the Darwin core quick ref guide uses:

cf. var. oxyadenia (for Quercus agrifolia cf. var. oxyadenia with accompanying values Quercus in genus, agrifolia in specificEpithet, oxyadenia in infraspecificEpithet, and var. in taxonRank)

To me this indicates that "cf. var. oxydenia" should be placed in the identificationQualifier field, but I agree with @ymgan that it seems strange to then also put oxyadenia in infraspecificEpithet and var in taxonRank, since the taxon has only actually been determined to the species level with any confidence.

I think that for data usage we should be working on confident identifications, and we need to also think of usage of open nomenclature terms in addition to cf. and aff. and standardise the usage of all of these. The usage of stet. and indet. result in no lower identification beyond that, but we still need to refer to the lowest taxonomic level determined in ScientificName. By including oxyadenia in the specific epithet we begin mixing identifications that are 'confident' with interpretations of identification which are usually not known by the data user.

@deepreef 's logic of

If any non-null value is provided for identificationQualifer, then disregard the name provided for the lowest taxonomic rank, and instead use the next-higher taxonomic rank as the confident identification

Therefore does not make the most sense to me.

deepreef commented 1 year ago

Many thanks, @tammyhorton! This is really helpful to me, and has given me a new perspective on how to think about the best ways to represent content for scientificName, identificationQualifier, and the various other related "parsed" values.

I had always thought of scientificName as being something along the lines of "the complete text string used to identify the organism, including authorship, except expanding abbreviations to full names and removing any identification qualifiers".

So, if the determination label on the specimen said something like "Q. agrifolia cf. var. oxyadenia", and I was extremely confident that "Q." was an abbreviation of "Quercus", then I would populate dwc:scientificName with:

"Quercus agrifolia var. oxyadenia"

[with "Q." expanded to "Quercus", and the "cf." extracted]

I would then present "cf." as the value for dwc:identificationQualifier.

Here, for reference, is the definition of dwc:scientificName:

The full scientific name, with authorship and date information if known. When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term.

So, my reading of "full scientific name" combined with "lowest level taxonomic rank that can be determined" would yield "Quercus agrifolia var. oxyadenia"; and the "identification qualification" of "cf." would be excluded from scientificName and instead be presented as the corresponding value for dwc:identificationQualifier.

But if I understand your persepctive correctly, the existence of the "cf." qualifier excludes the "var. oxyadenia" bit from being part of the "can be determined" aspect of an identification, and that only the "Quercus agrifolia" meets the threshold for "can be determined".

I think this perspective is entirely valid -- especially when "precision" is favored over "recall" -- and points to a potential ambiguity in the defnition of dwc:scientifiName, as well as a potential inconsitency in the definition of dwc:identificationQualifier. The potential inconsistency is that the definition of dwc:identificationQualifier:

A brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the Identification.

...seems to favor that only the "cf." part should be represented for this term.

I think I slightly favor the interpretation that scientificName in this case should be "Quercus agrifolia var. oxyadenia", and identificationQualifier should simply be "cf."; but I recognize that this preference is probably entirely because I've always thought of it this way.

I guess it comes down to the distinction I mentioned earlier, which could be simplified to:

1) The value of scientificName is as complete as possible, and the value of identificationQualifier is intended to cast doubt on the lowest level taxonomic rank part of the scientificName; vs.

2) The value of scientificName should only be the identification expressed with confidence, and the value of identificationQualifier is intended to provide additional information about a possible but uncertain more-precise taxonomic identification.

I'm less concerned about which way dwc goes with this, than I am about removing ambiguity and potential inconsistencies -- which appear to exist in this example because different practitioners have interpreted these definitions in different ways.

@mdoering or @timrobertson100 : I wonder if you could do some analysis to see how often a taxonomic name epithet appears in identificationQualifier, and then in such cases, how often it is repreated in specificEpithet or infraspecificEpithet.

tammyhorton commented 1 year ago

Thanks @deepreef , you have summarised the situation much better than I can. Yes, you have my perspective correct on the 'can be determined' part of dwc:scientificName, your number 2. I feel the same in that this is how I've interpreted it and it makes sense to me, so it seems the most logical way of representing it. But yes, the current definition of dwc:identificationQualifier does indeed state 'a brief phrase or standard term' which does not exactly capture how I am interpreting it.

I agree that we need to think about removing ambiguity and ensuring everyone understands how to use the field, but also about the ongoing use of the data and confidence in it. I think it will be useful to be able to search for a particular taxon and sort according to whether it has an identification qualifier or not, but also be able to easily compare those entries with identification qualifiers.

Jegelewicz commented 1 year ago

I think that for data usage we should be working on confident identifications

While I agree this is a goal - it also then excludes from the data the messy and not-quite identified things that someone might be interested in working with. I prefer @deepreef scenario 1 as it provides me the ability to find all of the varieties and the flexibility to remove or modify data with qualifiers if I choose.

In Arctos, all qualifiers are forced to the end of the scientific name as our names are structured in a controlled vocabulary. From my perspective, the qualifiers should probably be accompanied with an explanation, because it is clear to me from this discussion that different groups have different ideas about what the qualifiers mean. Arctos is also in the process of adding an attribute to identifications = identification confidence which is meant to provide more detail about the determiner's confidence in their identification than these codes that have many definitions depending upon who is doing the interpreting. When we start migrating that information to Darwin Core, it will most likely end up in identificationQualifier, concatenated with whatever modifier was also applied to the scientificName.

We also have information in our identifications that doesn't make it into a good place in DwC but probably should - identification method can also be important when considering how confident you are in an identification that someone else made but DwC does not include a method for identifications, that I'm guessing generally ends up in identificationRemarks or just not in DwC at all.

image

there is no such thing as a "certain" identification, except for name-bearing type specimens

It seems like the time might be good for a Task Group to work on this? What happens when you have two determiners making conflicting identifications? Are you forced to choose one or can you provide both with methods used and let the users make their own decisions?

Mesibov commented 1 year ago

When auditing DwC datasets I often see iQ used in odd ways. A recent job had a family name in scientificName and "indet." in iQ, and a genus name in sN and "sp." in iQ. However iQ is defined there will be compliance problems, so it's a good idea to give lots of examples to reduce these.

I also see what could be valid iQ entries in identificationRemarks, which is meant to have "comments or notes about the identification". This is so broad it entirely contains iQ as a subset.

Mesibov commented 1 year ago

@Jegelewicz writes "What happens when you have two determiners making conflicting identifications? Are you forced to choose one or can you provide both with methods used and let the users make their own decisions?"

The current DwC allows you to pick the latest ID and put any other IDs in previousIdentifications (which for reasons I don't understand is in the Organism class, not the Identification class).

A related problem is "Aus bus or Aus cus". IOW, the identifier is not only confident about the genus, but also confident that the species is either bus OR cus. I'm currently recommending "Aus" in scientificName, "genus" in taxonRank and "either Aus bus or Aus cus" in identificationRemarks for this case, but there might be better solutions.