tdwg / attribution

Joint TDWG/RDA group on metadata standards for attribution of physical and digital collection stewardship
13 stars 4 forks source link

property:identificationID #39

Open dshorthouse opened 3 years ago

dshorthouse commented 3 years ago

identificationID (property)

Definition An identifier for the Identification (the body of information associated with the assignment of a scientific name). May be a global unique identifier or an identifier specific to the data set. The intent of this term is to maintain the relational integrity between the concept:identified action and a particular determination/annotation expressed in the Identification History Extension when present.
Existing property identificationID
Existing namespace http://rs.tdwg.org/dwc/terms/
Existing property identifier http://rs.tdwg.org/dwc/terms/identificationID
Format string
Required no
Constraints
Examples 9992
Notes
debpaul commented 3 years ago

Hm. Any way to avoid using "identifier" and Identification together in the definition?

Perhaps: an identifier for the Determination ....

dshorthouse commented 3 years ago

@debpaul - agree, it's a brain-numbing tautology and less than helpful for non-English speakers. But, it's the DwC term's definition: https://dwc.tdwg.org/list/#dwc_identificationID

debpaul commented 3 years ago

So, I wonder how hard it would be (dare I say it) to request the definition be changed? @jmacklin

deepreef commented 3 years ago

So...yeah.... I've had this problem with DwC ever since the Identification class was first established. I really, really, really wish that the word "Determination" had been used instead (that's a common word in the Museum community that means pretty-much the same thing). The tautology with the oh-so-fundamental-to-our-domain "Identifier" just begs for even more confusion & miscommunication than we already suffer.

Having said that, I agree with @dshorthouse (or at least, what he seems to suggest above), that it's probably not worth it to change the term. The identifier/Identification confusion is probably the lesser of evils compared to changing the long-established DwC Class term name. The former is annoying, but the latter has practical implications for anyone who has implemented exports using that DwC class.

My 2 cents, anyway...

rukayaj commented 3 years ago

It sounds like identificationID is supposed to be linking to the identification history extension https://rs.gbif.org/extension/dwc/identification.xml ? I think it's great to start breaking the star schema a bit like this! But what about measurementID for the "measured" actions? There's a lot of potential here.

dshorthouse commented 3 years ago

@rukayaj Rukaya - There are indeed other extensions that have terms for agents. The GGBN extensions come to mind. And so, this has the potential to become very, very messy if we were to attempt to fully accommodate all of them as separate columns/terms here to work-around the limitations in the star schema.

The Extended Measurements or Facts extension indeed has the term measurementDeterminedBy. We could also add measurementID just as we've added identificationID here. Or, do we need a more generic way to reference another extension, perhaps with two terms extension and extensionID? This assumes of course that there is merely a single term in another extension that holds an agent namestring referred to here in this extension. And, it sets a precedence that effectively builds an artificial, hierarchical relationship among extensions that will most certainly result in illogical joins (i.e. what if a future extension also has referential terms extension and extensionID)?

rukayaj commented 3 years ago

I think a more generic way to reference another extension is a good idea, but it'd need extension, extensionID and extensionAgentTerm (or similar) to work around the problem you've highlighted of there potentially being more than one term that holds an agent namestring. I am guessing GBIF would have to do a lot of work to ingest data in this format, and it does make validating more complicated.

I'm not following why it would result in illogical joins, or why it would be a problem if a future extension had the same referential terms, can you give me an example?

dshorthouse commented 3 years ago

I'm not following why it would result in illogical joins, or why it would be a problem if a future extension had the same referential terms, can you give me an example?

I knew you were going to ask this :)~

What I'm getting at here is the distinct possibility for intended (or unintended) many:many joins. I struck me that this dimensionality opens up a world of new pain.

deepreef commented 3 years ago

I think our entire community has grossly underutilized dwc:ResourceRelationship. It seems to me that all inter-class relationships should be captured and represented through this class and associated properties (and probably the intra-class ones as well). But I guess our community isn't really there yet. Perhaps the limitation is that it assumes that resourceID and relatedResourceID are somehow self-evident in terms of their domain (= extension?) Maybe dwc:ResourceRelationship needs two more properties along the lines of resourceDoman and relatedResourceDomain?

rukayaj commented 3 years ago

@deepreef are you suggesting using dwc:ResourceRelationship as a bridging table to capture the relationship/actions between agents and records in cores/extensions, allowing a many:many relationship? How would resourceDomain and relatedResourceDomain work, like this?:

resourceRelationshipID resourceID relatedResourceID relationshipOfResource relationshipAccordingTo relationshipEstablishedDate relationshipRemarks resourceDomain relatedResourceDomain
111 a 1 collected by 2007-03-01T13:00:00Z/2007-03-02T18:00:00Z occurrence agent
222 a 2 georeferenced by 2007-05-01 location agent
333 a 3 quality-checked by 2007-05-02 location agent

It doesn't let you capture things like displayOrder and role though.

So yes, this kind of thing might open up a world of new pain, and perhaps our community is not really there yet. But it's 2020, and we SHOULD be there, tackling these painful problems head on. Anyway, it's intriguing to think about and discuss, and we should keep in mind that sometimes you can't wait for people to be "ready", you just have to kickstart the revolution :)

deepreef commented 3 years ago

@rukayaj - I'm not necessarily advocating it, just suggesting that dwc:resourceRelationship can serve this function (and many other functions in dwc-space where we fret about many-to-many relationships).

But...yes, that's along the lines of what I was thinking. You're absolutely right that an additional property of what I would call "sequence" (=display order) would be very valuable in this particular use case (and actually in a bunch of other use cases as well, so it may be a general-enough property that it could be justified for addition to this dwc class -- I would certainly support it!)

The Domain thing is tricky, and very-much not in the realm of TDWG-2010; but maybe you're right that 2020 is the time to start embracing these things. So, my point about Domain is that resourceID and relatedResourceID need context. If they are LOD-compliant identifiers (i.e., http URIs), then context is embedded within the identifiers themselves. However, as an ardent proponent of disentangling identifiers from their dereferencing mechanisms, I prefer to parse the "identifier" bit from the "dereferencing" bit. The examples you give are certainly one way to at least point to context for the respective identifiers, but it might be better to add two additional properties for resourceClass and relatedResourceClass to capture values like that. There are other viable options for the Domain. For example, if resourceID is a DOI, then resourceDomain could be "http://doi.org". Or, maybe just "DOI". Or "Digital Object Identifier". Or, in the case of an person represented by an ORCID, then "ORCID". It would be better to use URIs for the Domains to make them machine-actionable. But even in 2020 I'm not sure we're there yet. Also, this probably applies to all of the "ID" terms in DwC, so perhaps not particular to this one (except in other uses for dwc:[x]ID terms, the context is usually more self-evident). Lots of messy things to work out yet.... but I think there still may be some "there" there (in utilizing dwc:ResourceRelationship more effectively).

As for role, I guess I'm still trying to get my head around the difference between that and AgentAction (i.e., as captured in relationshipOfResource). It seems to me like "role" is just a more finely-parsed/granular version of AgentAction, but that's probably due to my incomplete comprehension of the distinction.

During one of the TDWG working sessions last month I asked if anyone was using dwc:ResourceRelationship, and quite a few said "yes". So maybe others have bumped into these issues (and created solutions)?

On a final note, I wrestled with this a lot when developing BioGUID.org -- which is essentially just a robust version of dwc:ResourceRelationship. I'm getting ready to dust that effort off, which is why I've been thinking about these things a lot lately.

deepreef commented 3 years ago

One other issue with dwc:ResourceRelationship -- some relationships are symmetrical, but most are not. Thus, a fair bit of thought needs to be put into directionality (e.g., is an agent always the relatedResource, or might it be the resource?). The brute-force approach is represent every relationship in both directions, but that can be cumbersome.

rukayaj commented 3 years ago

Adding sequence to dwc:ResourceRelationship makes a lot of sense to me as well.

Ok, I see what you mean with the Domain thing, and yeah it does seem quite complicated. But it does seem that making some kind of start on using dwc:ResourceRelationship, however slowly, would be a good thing. And I think it would be cool if it was used to link up agents and actions with classes in a more normalised data structure. In our case, and I suspect in most cases, most of the data we published is stored in normalised tables anyway, and it's always seemed a bit weird to denormalise it to publish it (as it results in bigger files).

Yes, I'm also a bit uncertain about the line between agentAction and Role. It'd be good to explore that a bit more.

Re the directionality of relationships in dwc:ResourceRelationship - yes it would be a drag to do it both ways. I'm not totally following what you mean by most relationships not being symmetrical though, surely if a specimen is collected by a collector, it follows that the collector collected the specimen? I mean, it can be traversed both ways. Is that what you mean by symmetry?

nielsklazenga commented 3 years ago

A symmetrical relationship is where the relationship and the inverse relationship are the same. Relationships between resources of different type can not be symmetrical. In this case, the inverse relationship of 'recordedBy' would be 'recorded' and that of 'identifiedBy' 'identified' etc.

The bigger problem is that ResourceRelationships where the related resource is of a different type than the resource will break the Darwin Core Archive star schema, if you want to have both types of resources in the same archive.

deepreef commented 3 years ago

@nielsklazenga : agreed on breaking the the star schema, unless the relatedResourceID is a URI that is itself actionable.

nielsklazenga commented 3 years ago

@deepreef Yes, that's why I added the 'if you want to have both types of resources in the same archive'. (I wrote 'schema' accidentally initially)

rukayaj commented 3 years ago

What are the implications of breaking the star schema? identificationID as proposed above is kind of breaking the star schema already, isn't it?

dagendresen commented 3 years ago

[It is also possible to imagine a world (that does not exist yet) where an aggregator could simply lookup the properties it is interested in from the dereference endpoint...]

infinite-dao commented 9 months ago

Is there any clear example where identifier and identificationID are used properly in one data set itself? I don't quite understand what “An identifier for the identification” shall point to, i.e. from which side of the identification? Source/Resource?

When I do a name matching, and find new identifier, I have the source side and the resource side, to which side are the terms identifier or identificationID intended design-wise, that is not yet clear to me.

For example, I have an occurrenceID (source) and after name matching I also have a WikiDataID (=resource, or other IDs). To which side are identifier and/or identificationID now intended? Thank you for the clarification.

matdillen commented 9 months ago

identificationID is included in this extension solely for the case of the agent performing the action of identified, and the non-agent data of this "identification event" (e.g. the taxon name assigned to the specimen) being listed in the Identification History extension to Darwin Core. This extension uses identificationID as its primary key and hence a link can be made between extensions to avoid ambiguity.

In short, identifier is a PID for the agent performing the action, whereas identificationID is a locally unique ID for the identification event which the agent performed (if action is identified).

timrobertson100 commented 9 months ago

Edited: this was based on a misunderstanding of the above. Please ignore

whereas identificationID is a locally unique ID for the identification event which the agent performed (if action is identified).

~Thank Matt. With this clarification, I think the opening definition might be tightened up a bit. As currently written, it could be interpreted as an identifier for the result of the identification activity, not an identifier for the activity (event) itself:~

An identifier for the Identification (the body of information associated with the assignment of a scientific name)...

~Perhaps something along the lines of this (someone more eloquent than me can surely improve it, but I hope you get the gist):~

An identifier for the activity (Event) that results in a scientific name or other taxonomic unit being applied to the...

~(taxonomic unit added for molecular-based identification, such as to some species hypothesis)~

nielsklazenga commented 9 months ago

Isn't it just:

An identifier for the dwc:Identification

?

I do not think there is any need to be cleverer than that. This is one of the very few instances where it actually might be good to use the namespace alias in the definition. Darwin Core does not distinguish between the action and the result and why would it? Just confuses the hell out of people.

timrobertson100 commented 8 months ago

Thanks, @nielsklazenga - that is what I understood from the opening description, and your proposal is nice and clear. It was the reference to "whereas identificationID is a locally unique ID for the identification event" that threw me.

matdillen commented 8 months ago

The confusion did not arise from action/event vs result, but from the ambiguity between identification as in dwc:Identification, i.e. a taxonomic determination, and identification as in adding a PID (identifier) to a person name (such as the person who did the identification, i.e. dwc:identifiedBy).

I didn't intend to propose a new definition, just clarify the distinction. Maybe we can just replace the first sentence in the current definition (including the part in parentheses) with Niels's proposal? But keep the additional clarification explaining why the term is in this extension, as it is a bit weird (cf. discussion a few years ago). Ergo:

An identifier for the dwc:Identification. May be a global unique identifier or an identifier specific to the data set. The intent of this term is to maintain the relational integrity between the concept:identified action and a particular determination/annotation expressed in the Identification History Extension when present.

timrobertson100 commented 8 months ago

Thanks @matdillen - I think I just misunderstood your intent then.

infinite-dao commented 8 months ago

The confusion did not arise from action/event vs result, but from the ambiguity between identification as in dwc:Identification, i.e. a taxonomic determination, and identification as in adding a PID (identifier) to a person name (such as the person who did the identification, i.e. dwc:identifiedBy).

yes, exactly, my thinking in everyday language, to identify something, identifier etc. was my misunderstanding, now I have understood that identificationID only applies in the strict sense of action = identified. Thanks, sometimes it helps when one reads out the full term identification identifier and not swollow ID into nothing ;-)