tdwg / camtrap-dp

Camera Trap Data Package (Camtrap DP)
https://camtrap-dp.tdwg.org
MIT License
43 stars 5 forks source link

How to express classification confidence? #170

Closed peterdesmet closed 1 year ago

peterdesmet commented 3 years ago

Not quite sure what to make of your classificationConfidence. Is the intent to:

I'm assuming it's more like the first or second one & not the last one. That last one illustrates that the use of the word 'classification' is probably best avoided because it's a loaded term. Or, does your use of classification mean categorization as in, "This is a female cougar and cubs vs. this is a lone male"?

Either way, I see your term is a probability whereas identificationVerificationStatus is meant to be categorical. The first two items above for classificationConfidence illustrate that it's a slippery concept. I suppose you'd have to put yourself in the shoes of a user. What would a confidence of 0.65 vs 0.6 mean to a user? Who establishes the criteria for how to construct a value? Is it sample-based? If so, what happens to this probability if the data are pooled with other datasets? Is it an algorithm or a human that constructs it? Would there be any loss of information to a user of the data if instead of classificationConfidence you used identificationVerificationStatus with categorical data or a mere boolean (i.e. "Yep, a human looked at it." vs "Nope, a human has not looked at it.")?

Originally posted by @dshorthouse in https://github.com/tdwg/camtrap-dp/issues/169#issuecomment-913692929

peterdesmet commented 3 years ago

@dshorthouse

  1. It is the determiner (identified in classifiedBy: human or AI)'s own confidence in the identification. AI can specify this pretty accurately, for humans the percentage is more arbitrary, but it's generally not set by the user. E.g. in the management software we have, we provide the AI value as is, but for humans we use 0.5 if they marked the boolean field uncertain and 1 if the classification was verified by a human.
  2. I personally think it's not a good equivalent of identificationVerificationStatus, if only that one is a percentage and the other is categorical, which are different concepts.
  3. Regarding the word classification: it is a loaded term indeed and one mostly coming from the machine learning world, but with other connotations in biodiversity. Feedback welcome on the term in https://github.com/tdwg/camtrap-dp/issues/164

Originally posted by @peterdesmet in https://github.com/tdwg/camtrap-dp/issues/169#issuecomment-913750488

peterdesmet commented 3 years ago
  1. It is the determiner (identified in classifiedBy: human or AI)'s own confidence in the identification. AI can specify this pretty accurately, for humans the percentage is more arbitrary, but it's generally not set by the user. E.g. in the management software we have, we provide the AI value as is, but for humans we use 0.5 if they marked the boolean field uncertain and 1 if the classification was verified by a human.

Aha! There could be two activities at play here, expressed in the same field. The first is if the original determiner self-declares their uncertainty and the second is if someone else indicates their agreement (or dissent?) with the determiner. I'm assuming verification does not always mean agreement. What if there are several of the second kind, much like in iNaturalist? This does indeed get us into messy annotation space.

Regardless of whether or not it's an AI that made the original determination, these are distinct actions that attempt to convey trustworthiness. Should these be split such that downstream users are better informed when deciding what to toss or what to use? The question is if all this is merely noise or does it make the data more transparent and powerful? What does a user of the data expect to do when presented with any value in classificationConfidence?

Originally posted by @dshorthouse in https://github.com/tdwg/camtrap-dp/issues/169#issuecomment-913771093

peterdesmet commented 3 years ago

the second is if someone else indicates their agreement (or dissent?) with the determiner.

The field is intended for self-declaration, not to judge the identification of others. In the case of multiple (conflicting) identifications, only one would be exported to Camtrap DP, typically the one with the highest confidence (as defined by the system), e.g. AI < volunteer < expert validation.

I see your point though, maybe a boolean certain vs uncertain conveys clearer information to the user on what to use (but note that many of the observations will have an empty classificationConfidence). It is then up to the data publisher to decide what AI confidence can be considered certain. It is a loss of information, but he/she can probably judge better than the data user.

rondlg commented 3 years ago

Indicating AI verses Human is really useful so it would seem that identificationVerificationStatus is a good place for that. Not to put the cat amongst the pigeons though, there is also identificationQualifier:

https://dwc.tdwg.org/list/#dwc_identificationQualifier

I bet if data was pulled from this field you would find both things like aff. and cf. but also possible, maybe, ?, uncertain, certain etc. etc.

peterdesmet commented 3 years ago

Hi @rondlg, human vs machine is useful, which is why we have a dedicated field for it: classification_method.

Note that we are looking for "equivalents" in Darwin Core, i.e. this Camtrap DP field is the same concept as this field in Darwin Core. Although I think that identificationVerificationStatus and identificationQualifier are reasonable fields to map data to, I wouldn't consider them the same concepts as the Camtrap DP fields. Would you agree?

rondlg commented 3 years ago

Yup (my bad) I see classification method now.

I'd actully say that that identificationVerificationStatus and identificationQualifier are the same concept - but not a hill I need to die on ;)

dshorthouse commented 3 years ago

Might be something here: https://doi.org/10.1093/database/bav043

danstowell commented 2 years ago

I work on AI methods so "classification confidence" is a salient issue for me. I agree that self-declared confidence is the main point. Allow me to respond to this discussion thread, and to make a minor edit suggestion.

"What does a user of the data expect to do when presented with any value in classificationConfidence?"

IMHO there are two main use-cases:

  1. Thresholding: for dataset X, downstream user A wants to use all detections, while downstream user B wants to use only those with high confidence (e.g. p>0.9). This is a very likely situation.
  2. Using confidences as weighting values e.g. for aggregation: basic distribution maps can be produced by summing all the detection probabilities in each cell, for example. It's not the most sophisticated way to use the data, but it's numerically meaningful if the confidences are well-calibrated probabilities, and useful for basic reporting.

Both of these are covered well if confidences are expressed as probabilities. I appreciate that it's hard to get true probabilities from manual (human) annotations, and that there are other approaches (e.g. categorical or rank-based).

In my humblest opinion: (a) manual annotations should typically not come with confidences expressed as probabilities (unless they've actually been estimated by some procedure), but merely with attribution of which person/project/institution did the annotation; (b) AI-derived annotations should be strongly encouraged to include confidences expressed as probabilities.

All that said, the current text under https://tdwg.github.io/camtrap-dp/data/#observations.classificationconfidence is good enough. However, may I propose to change "Provide an approximate value for human classifications" to "For human classifications, omit this field (in CSV, an empty string) or use an approximate value if available".

peterdesmet commented 1 year ago

This has been discussed and addressed in https://github.com/tdwg/camtrap-dp/pull/208. The classificationProbablity is now defined as:

Degree of certainty of the (most recent) classification. Expressed as a probability, with 1 being maximum certainty. Omit or provide an approximate probability for human classifications.

I think this addresses the points raised here: