tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
206 stars 70 forks source link

Proposal: basisOfIdentification and identificationConfidence #217

Closed ianengelbrecht closed 8 months ago

ianengelbrecht commented 5 years ago

May I propose these properties be considered for the Identification class? They are described in the Barcode of Wildlife standards. It may be a good idea to separate the concept of confidence out of dwc:identificationQualifier. This article on open nomenclature provides a nice clarification of what aff., cf. and identification certainty are and how they relate to each other.

ianengelbrecht commented 5 years ago

I see that the definition for dwc:nameAccordingTo says 'For taxa that result from identifications, a reference to the keys, monographs, experts and other sources should be given'. This gives dual meaning to this term depending on what kind of dataset it is. If basisOfIdentification is added as an Identification term these could be separated.

danstowell commented 5 years ago

I'm interested in being able to specify the "confidence" in a detection/identification. I would suggest to specify this as a numerical probability between 0 and 1 inclusive. Although not everyone thinks in terms of probabiities, I hope that format would be the least susceptible to misinterpretation, and would also be usable in further analysis.

(I'm also interested in the ability to express multiple possible species identifications e.g. {"Luscinia megarhynchos": 0.6, "Luscinia luscinia": 0.3}, but perhaps that's outside the scope of this thread?)

qgroom commented 5 years ago

Before recommending a term identificationConfidence I think some research and discussion is needed. How could this be calculated objectively, or should is be a controlled vocabulary. Who should determine it? Which identification does it refer too? Shouldn't it be in the Identification History extension, rather than the main part of Darwin Core. identificationQualifier already exists in the extension. So I would recommend that this issue becomes part of an identifications task group that could address the many issues about identifications in DwC.

tucotuco commented 5 years ago

I agree with @Quentin Groom quentin.groom@plantentuinmeise.be about thinking of this in terms of an extension. I have the same concerns. There is also the terms identificationVerificationStatus to consider, which has a bit of overlap with what is being proposed for identificationConfidence.

On Sat, Nov 9, 2019 at 3:20 PM Quentin Groom notifications@github.com wrote:

Before recommending a term identificationConfidence I think some research and discussion is needed. How could this be calculated objectively, or should is be a controlled vocabulary. Who should determine it? Which identification does it refer too? Shouldn't it be in the Identification History extension, rather than the main part of Darwin Core. identificationQualifier already exists in the extension. So I would recommend that this issue becomes part of an identifications task group that could address the many issues about identifications in DwC.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/217?email_source=notifications&email_token=AADQ72737643DHZZ4VWNTPTQS35NJA5CNFSM4G7Q5NQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUMAVQ#issuecomment-552124502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727STEVAIL4KFNG6ABTQS35NJANCNFSM4G7Q5NQQ .

danstowell commented 5 years ago

I'm fully agnostic about whether Ian's proposal should be in core or extension, since I've not been involved before.

Ian's original proposal mentions "It may be a good idea to separate the concept of confidence out of dwc:identificationQualifier".

I take it the confidence would be determined by the source asserting the overall record (as are the other fields, after all), and would refer to the overall record. (Would love to have confidence on a per-field basis but probably too complex for this standard.)

If "controlled vocabulary" is the consensus position, I'd hope for a vocabulary that can be mapped onto probabilities. A good example may be the IPCC vocabulary about uncertainties - see Table 1 in this PDF.

ianengelbrecht commented 5 years ago

My feeling is that we should aim to model current practice with identifications while at the same time promoting good practice. The article mentioned in the original post describes nicely the (potential) dichotomy between what is meant with cf. versus the proposed dwc:identificationConfidence. Briefly, they refer to different forms of uncertainty that might exist in an identification. The first is uncertainty about whether a particular specimen fits into a taxon concept. The person doing the identification is confident in their knowledge of the taxonomy for the group and the taxa are quite clearly defined, but the specimen doesn’t quite fit a known taxon. The specimen is also not clearly something new (in which case the qualifer ‘aff.’ would be used, and there is no uncertainty). On the other hand, dwc:identificationConfidence should represent the uncertainty as a result of limited knowledge of the identifier. It’s the equivalent of a question mark after a taxon name. An example might be that I look at a theraphosid spider that I think might be Ceratogyrus pillansi. The type specimen is lost, and the type locality is imprecise (Rhodesia), so I write on the det. label ‘Ceratogyrus pillansi?’ Changing species to something where the taxon concept is clearer, having two separate terms would allow for something like ‘Ceratogyrus aff. darlingi?’ which is equivalent to ‘Ceratogyrus cf. darlingi’. We could also have ‘Ceratogyrus cf. darlingi?’ I’ve never seen this but it it would equate to ‘Ceratogyrus darlingi?’.

In practice we see a fair number of both cf. and ? on specimen labels during data capture. In discussions with taxonomic experts the response to using two different indicators of uncertainty has been mixed, and seems to depend on the discipline and the preferences of the individual.

If identificationConfidence were to be adopted, my own feeling is that it should only ever be binary (confident or not confident) or a probability based on a valid quantitative analysis, along the lines of @danstowell’s suggestion above. What I feel MUST be avoided is a list of ordinal levels of certainty. I’ve used these in existing applications (iSpot being one) and even implemented it in my own databases in the past. All you end up with is people confused as to whether they are ‘certain’, ‘very certain’, or ‘highly certain’ about their identifications.

tucotuco commented 5 years ago

@Ian Engelbrecht ianicus.za@gmail.com Does identificationVerificationStatus ( http://rs.tdwg.org/dwc/terms/#dwc:identificationVerificationStatus) not cover the same concept as the proposed identificationConfidence?

On Sun, Nov 10, 2019 at 5:18 AM Ian Engelbrecht notifications@github.com wrote:

My feeling is that we should aim to model current practice with identifications while at the same time promoting good practice. The article mentioned in the original post describes nicely the (potential) dichotomy between what is meant with cf. versus the proposed dwc:identificationConfidence. Briefly, they refer to different forms of uncertainty that might exist in an identification. The first is uncertainty about whether a particular specimen fits into a taxon concept. The person doing the identification is confident in their knowledge of the taxonomy for the group and the taxa are quite clearly defined, but the specimen doesn’t quite fit a known taxon. The specimen is also not clearly something new (in which case the qualifer ‘aff.’ would be used, and there is no uncertainty). On the other hand, dwc:identificationConfidence should represent the uncertainty as a result of limited knowledge of the identifier. It’s the equivalent of a question mark after a taxon name. An example might be that I look at a theraphosid spider that I think might be Ceratogyrus pillansi. The type specimen is lost, and the type locality is imprecise (Rhodesia), so I write on the det. label ‘Ceratogyrus pillansi?’ Changing species to something where the taxon concept is clearer, having two separate terms would allow for something like ‘Ceratogyrus aff. darlingi?’ which is equivalent to ‘Ceratogyrus cf. darlingi’. We could also have ‘Ceratogyrus cf. darlingi?’ I’ve never seen this but it it would equate to ‘Ceratogyrus darlingi?’.

In practice we see a fair number of both cf. and ? on specimen labels during data capture. In discussions with taxonomic experts the response to using two different indicators of uncertainty has been mixed, and seems to depend on the discipline and the preferences of the individual.

If identificationConfidence were to be adopted, my own feeling is that it should only ever be binary (confident or not confident) or a probability based on a valid quantitative analysis, along the lines of @danstowell https://github.com/danstowell’s suggestion above. What I feel MUST be avoided is a list of ordinal levels of certainty. I’ve used these in existing applications (iSpot being one) and even implemented it in my own databases in the past. All you end up with is people confused as to whether they are ‘certain’, ‘very certain’, or ‘highly certain’ about their identifications.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/217?email_source=notifications&email_token=AADQ72ZIYHQXZHUIAVLCRJTQS67U5A5CNFSM4G7Q5NQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUX6OY#issuecomment-552173371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727XOLC4BV5EP7FQ4HTQS67U5ANCNFSM4G7Q5NQQ .

ianengelbrecht commented 4 years ago

Apologies for the delay in responding. My feeling is not. Verification should be a separate process to making an identification. The workflow should be that one person identifies a specimen, and someone else, ideally with better knowledge of the taxon or more experienced, verifies that identification or corrects it with their own, new, identification. I indicated a similar line of thinking for georeferenceVerificationStatus, and proposed georeferenceVerifiedBy and georeferenceVerifiedDate properties there (the equivalents for identification verification would be required too). A person shouldn't be able to verify their own identifications (or georeferences), unless perhaps by another method, such as confirmation of a morphological identification using molecular data. A real world example of identification verification is iSpot, which has an 'I agree with this ID' on it's identification form, which is only available to others, and iSpot records who agrees with an identification. On the contrary, I don't know of any collections databases that implement identification verifications.

ianengelbrecht commented 4 years ago

An alternative term for identificationConfidence may be identificationCertainty

ianengelbrecht commented 4 years ago

Regarding additional fields for ...verifiedBy and ...verifiedDate, an alternative might be simply record all of that information in ...verificationStatus.

ianengelbrecht commented 4 years ago

For basisOfIdentification:

Definition: The method, tool, or rationale used in identifying the specimen. Comments: Recommended best practice is to use a controlled vocabulary. Examples: 'tacit expertise', 'field guide', 'key', 'DNA' [or perhaps 'BLAST' or other algorithm used], 'type material for taxon', 'compared with type material', 'compared with non-type material'.

nielsklazenga commented 4 years ago

I think this is interesting, but it all depends on it having a good vocabulary, otherwise it is better to just use identificationRemarks.

I use phrases like 'field det.' and 'duplicate det.' (I work on mosses and have found that "duplicates" not necessarily belong to the same species). With more and more specimen images becoming available on line virtual determinations based on images also has become a thing.

And then there is AI of course.

Will be nice to have a term like this, which would be mostly 'morphology' for me and then the detail in the identificationRemarks.

quarrying commented 1 year ago

identificationConfidence used with identificationConfidenceType (same as dwc:organismQuantity used with dwc:organismQuantityType) could make it self-explanatory somewhat and avoid the necessity of clear definition or objective calculation.

The possible identificationConfidenceType includes: 1) Two-level: identificationConfidence could be one of {Unsure, Sure} or {0, 1} which could be decided and marked by user easily. 2) Three-level: identificationConfidence could be one of {High, Medium, Preliminary} showed in https://bwp-informatics.readthedocs.io/en/latest/bwp_data_standard.html or {0, 1, 2} which could be decided and marked by user easily. 3) Probability: identificationConfidence could be a continous numerical value in [0, 1] which often generated by AI algorithm. The probability generated by different algorithm has different meaning and is incomparable, as such we can use different identificationConfidenceType to differentiate each other.

tucotuco commented 8 months ago

Closing for lack of evidence of demand.

danstowell commented 8 months ago

By the way, Camtrap-DP has a key "classificationProbability" which has a similar role as identificationConfidence. For more background on the discussion, see Camtrap-DP issue 170 and Camtrap-DP issue 217.

(See also related discussion OBIS issue 209 which I wasn't aware of until just now.)