tdwg / camtrap-dp

Camera Trap Data Package (Camtrap DP)
https://camtrap-dp.tdwg.org
MIT License
46 stars 5 forks source link

classificationMethod Terms #224

Closed ben-norton closed 1 year ago

ben-norton commented 2 years ago

I strongly support the inclusion of a classificationMethod term. Use cases are more complicated than a simple human or machine. A growing number of observations are being produced by a combination of human and machine. Computer vision models that filter out blanks perform substantially better than animal classifiers, especially at the global or continental level. Models that place bounding boxes around objects of interest perform equally well. So you have a multi-step process for an observation that includes both human and machines. If both values are true, then which should be selected?

peterdesmet commented 2 years ago

If bounding boxes (#219) were provided a model and the scientificName by a human, then I guess you could express both in classifiedBy, but I typically favour to retain the classifiedBy who provided the highest classificationConfidence.

What information would you provide in a classificationMethod and what use cases would it solve?

kbubnicki commented 2 years ago

I think the same logic as proposed in #225 should also apply to classificationMethod e.g. machine | human

peterdesmet commented 2 years ago

Note that this field has a controlled vocab. Do we drop that then? Or do we recommend to only populate the latest classification (no pipes). Or is it better to have 2 observation (with diff classificationTimestamp)?

kbubnicki commented 2 years ago

You are right, then we would have to drop a controlled vocabulary, which is maybe not that terribly bad idea in this case (until we come with a better solution). Having 2 observations (or more) is an option already (btw very useful for testing AI models) but then we still do not know if they were made independently or if they were "chained" (machine -> human).

peterdesmet commented 2 years ago

Wouldn’t you be able to defer chained from the classificationTimestamp in 2 obs? I’m a bit reluctant to throw away a vocab 😊

kbubnicki commented 2 years ago

Not really, as 2 obs having different timestamps can still be classified independently and not "chained" together. By chaining I mean e.g. the following case: human expert verifies machine classification. But maybe cases like that are to specific for a data exchange standard?

peterdesmet commented 2 years ago

But maybe cases like that are too specific for a data exchange standard?

Imo yes. I would suggest:

@kbubnicki would that be ok for you?

kbubnicki commented 2 years ago

The solution is a bit anthropocentric i.e. assuming that humans always perform better than machines ;)

@kbubnicki would that be ok for you?

Yes, I think this is a good compromise!

MikeTrizna commented 2 years ago

I suggest adding something to the definition for "classificationMethod" to point to the connection with "classifiedBy", and vice versa.

I was scanning through the terms during the November webinar to see how to list which AI model was used to determine a species, and was surprised to see that classificationMethod was simply an enum of human or machine. I stopped right there and went to the Issues to find a discussion about that, before I even realized that there was a following classifiedBy term. I admit this was lazy reading on my part, but I can imagine other users missing the connection as well.

peterdesmet commented 1 year ago

See https://github.com/tdwg/camtrap-dp/issues/225#issuecomment-1420784370, we won't allow multiple (| separated) values for the classification terms.

@MikeTrizna the definition for classificationMethod has been updated from:

Classification method.

To:

Method (most recently) used to classify the observation.

I'd prefer not to reference classifiedBy in that definition, because it opens the door to having to reference the other classification terms too... as well as cross-referencing many other related terms in their definitions. 😅