tdwg / dwc-for-biologging

Darwin Core recommendations for biologging data
Creative Commons Attribution 4.0 International
13 stars 3 forks source link

What or who is the agent? #29

Open dshorthouse opened 3 years ago

dshorthouse commented 3 years ago

I see that scientificName is a required term. But I do not see either recordedBy or identifiedBy. I assume that many determination events will be executed by a human sometime after a captured event but that other determination events will happen in near real-time by a trained machine. Should you require scientificName without also requiring an agent who/that made the assertion? Is this meant to be captured in samplingProtocol and will content there be sufficiently machine-readable so as to differentiate determinations made by a human from those made by a machine?

jdpye commented 3 years ago

A great point. For the great majority of my own experience, the determination was made by the individual doing the tagging but we are certainly not being as explicit as we could be! samplingProtocol also feels wrong for this, I had envisioned using that field to describe things like the tag attachment method or any specifics of which animals were being selected for study (if your study was only on juvenile males, for example). I am warm to the idea of using identifiedBy, it seems like the exact thing for this purpose. The guidance seems to be to use names, but I see no reason why algorithms, or even a 'machine determined' string couldn't work.

albenson-usgs commented 3 years ago

The primary way we are differentiating human observations versus machine observations is by using basisOfRecord. See the explanation for basisOfRecord in the wiki "A single record in the Occurrence Core will designate basisOfRecord:HumanObservation to delineate the point in time that the animal was in hand. Future detections of the animal by stations will be designated basisOfRecord:MachineObservation. This is one of the primary ways (combined with organismID) a user would know that the multiple observations are actually multiple detections of the same animal.:

dshorthouse commented 3 years ago

@jdpye @albenson-usgs Gotcha re: basisOfRecord. Might you also want a more explicit way to state who the human was or what was the machine that made the later assertions, though these machine-based assertions feel like splitting hairs.

Antonarctica commented 3 years ago

and basisofrecord is also a required field. but I always try to convince people to fill out identifiedby by the person or algorithm that did the identification.

albenson-usgs commented 3 years ago

@dshorthouse I guess I'm not sure that someone would NEED to know who captured the animal for conducting downstream analyses. I do agree it's nice to have if you can get it but I don't know that not having it prevents future work from happening? I'm not a biologging expert though so hopefully @peggynewman or maybe @sarahcd can confirm. I can see how you might need to know what machine made the observation because it might help with uncertainty (maybe?). But that information might be better laid out in something like sampling protocol.

albenson-usgs commented 3 years ago

Actually after considering this further, I could see the case for making identifiedBy strongly recommended and using it the way @jdpye suggested might help make it even clearer which observations are human ones and which are machine ones.

peggynewman commented 3 years ago

Interesting. I agree with the approach that recordedBy and identifiedBy in biologging data goes alongside the Human Observation record and that broadly our approach is to group by organismId. We're likely to see these fields used more in repositories thanks to the kind of work that @dshorthouse is doing with Binomia. For biologging however it's the machine that's doing the observing, not recording or identifying, then that information doesn't belong in those fields. We are describing the machine capture mechanisms in the Event and MoF. An interesting differentiation might be a camera trapping project, which would have a separate process of recording then identification.

dshorthouse commented 3 years ago

@peggynewman In the context of biologging, it doesn't make much sense to ascribe credit for effort as might be assumed in the spirit of recordedBy or identifiedBy, which is (partly) what Bionomia tries to accomplish. However, the other intent of these terms is a bit more subtle. If we accept that the identity of an occurrence is subject to external slippage in taxon concept, then we need a safeguard to confidently assert alignment with a future concept. A naked scientificName without a corresponding statement of what resource was used (or who/what identified it as such) at a particular time and place will experience an intractable dissociation from future taxon concepts. I assume that biologging data has implications for conservation policy now and long into the future, but taxon concepts themselves are a moving target. In the majority of cases, I'm willing to bet that organisms that are tracked in your projects have relatively stable taxon concepts and there isn't much conflict. What I write here is undoubtedly overkill and immaterial...but this is true only for our present, small window of time.

Antonarctica commented 3 years ago

Taking from other best practices. A 'scientficName' should be linked to a 'scientificNameID' which is defined as: An identifier for the nomenclatural (not taxonomic) details of a scientific name. This gives some protection against slippage eg in case the scientific name the accepted and unaccepted names can be linked. The best practice is to have a globally unique identifier for instance a Life Sciences Identifier (LSID). For the marine species we use the World Register of Marine Species. for instance Aptenodytes forsteri Gray, 1844 can be found here http://marinespecies.org/aphia.php?p=taxdetails&id=225773 the id at the and is the AphiaID that WoRMS uses and it matches this lsid urn:lsid:marinespecies.org:taxname:225773 other taxonomic backbones can be used.

jdpye commented 3 years ago

Oh we've had a couple 'fun' taxon shakeups with manta rays and Atlantic torpedo/tetronarce, as well as some ambiguous identifiers with things like sixgill/sevengill sharks. At my institution we run things through marinespecies.org and back to the researcher with any discrepancies from their field reporting, and we identify the marinespecies.org entries as the authority as @Antonarctica has detailed. (We also track cases where the researcher is adamant that the taxonomic database has it wrong, though I don't know what to do with this information yet!) This grants us some ability to crossreference via TSN and AphiaID, now and in the future.

albenson-usgs commented 3 years ago

@jdpye reach out to WoRMS on the cases where the researcher is not in agreement (info at marinespecies.org). They are really responsive and helpful.

jdpye commented 3 years ago

Definitely. they're great at accepting new colloquial names, and I feel like the marine/brackish/fresh distinctions are maybe a little bit my fault because I made them add American alligators once upon a time.

peggynewman commented 3 years ago

I agree, a scientificNameID belongs with scientificName. In the situation where an algorithm has provided the species identification, I've been thinking that is more MoF lines. Is a persistent identifier for an algorithm a DOI on a publication or are there other options?

danstowell commented 3 years ago

Hi all. I'm trying to choose a clear way to indicate which software algorithm provided a taxon ID. I agree with a comment above by @jdpye that "identifiedBy" seems appropriate, though it would need its definition changing to encompass machines (not just people or groups) as the agent. On this question, there's a lot more discussion in the Attribution group here: https://github.com/tdwg/attribution/issues/38

jdpye commented 3 years ago

That is a great discussion, @danstowell , thanks for that link! I think we could potentially 'get away with' a lot because of the 'freetext identifiers separated by pipes' nature of the field in HumanObservations, but I'd love to see the MachineObservation side of things make use of that field. If we did that, you're absolutely right, definitions would need a bit of updating. For algorithms/implementations of identifying software in order to be complete we'd be looking to record a program name/version number, or better, a git URI and commit hash.

What I haven't done yet, and can do, is look through some of the other tdwg communities and their conversations to find out what other determinations have been handed down on this specific subject in the past.