tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
201 stars 70 forks source link

New Term - parentMaterialSampleID #344

Open deepreef opened 3 years ago

deepreef commented 3 years ago

New term

Proposed attributes of the new term:

Term originally proposed a year ago by @thomasstjerne on the GBIF GitHub. Discussion around changes to MaterialSample on DwC (#314) and GBIF issue #37. This new term has direct relevance to dwc:preparations, in cases where multiple different preparations are derived from the same whole specimen.

dagendresen commented 3 years ago

Excellent! Maybe the example format could be: urn:uuid:6e43b33d-88ce-4a37-ad94-74d6c99b9e25

deepreef commented 3 years ago

Thanks! RE: the example, I was following the template of other similar terms in DwC (e.g., materialSampleID). Also, I generally try to minimize the inclusion of dereferencing metadata from identifiers; but that's more of a personal preference.

dagendresen commented 3 years ago

Do you consider urn:uuid: as dereferencing metadata?

dagendresen commented 3 years ago

Example of a preserved specimen of bluethroat with a blood sample extracted for DNA in support of the need for the proposed parentMaterialSampleID term.

Preserved specimen (mounted)

basisOfRecord = PreservedSpecimen catalogNumber = NHMO-BI-104452/2-P occurrenceID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3 materialSampleID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3 organismID = urn:uuid:246afd01-f734-5da9-874b-4a09f26030f8

Blood sample

basisOfRecord = MaterialSample catalogNumber = NHMO-BI-104452/1-B occurrenceID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c materialSampleID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c organismID = urn:uuid:246afd01-f734-5da9-874b-4a09f26030f8 parentMaterialSampleID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3

DNA sample (not yet here, but available for other)

parentMaterialSampleID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c

...apropos, which begs the question if the reuse of the UUID for occurrenceID as the UUID for materialSampleID is at all the correct use, however, a value for occurrenceID is mandatory to enable the records to be published in GBIF.

dagendresen commented 3 years ago

Here is an example of a bluethroat (Luscinia svecica subsp. svecica) from which 7 MaterialSamples were extracted, in support of the need for the proposed parentMaterialSampleID term. For many of these bluethroats we lack parentMaterialSampleID to describe the hierarchy between material samples, sub samples for DNA. (To describe if the DNA sample is sub-sampled from the blood sample, from the tissue sample, from the sperm sample, etc..., each preserved as separate biobank MaterialSamples).

organismID-urn-uuid-e593838a-f7a9-5ef2-a04a-2bfc7c90771f organismID = urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f

deepreef commented 3 years ago

@dagendresen : MANY thanks for the great example!

Do you consider urn:uuid: as dereferencing metadata?

Well... I guess it's technically not "dereferencing" metadata (like http://dx.doi.org/ or https://doi.org/); but it is still metadata, which basically translates to "What follows should be interpreted as a Uniform Resource Name, of the type Universally unique identifier". The actual "identifier" itself is the stuff that comes after the second : (in the same way that the stuff that comes after the third / in https://doi.org/10.3897/zookeys.641.11500 is the actual identifier).

I don't want to hijack this thread, but just to make a point... this is the closest representation of the actual identifier for your organismID in the post above, that can be rendered in textual form: 11100101100100111000001110001010111101111010100101011110111100101010000001001010001010111111110001111100100100000111011100011111 (i.e., 128 consecutive bits, represented here as 1s and 0s)

A less cumbersome way to display this value to human eyeballs would be in hexadecimal form: e593838af7a95ef2a04a2bfc7c90771f (that reduces it to 32 characters, instead of 128)

It could also be represented as a decimal number: 305159146678742414161168577211252373279 (but that increases the number of characters to 39)

The most text-economical way to represent it is in base64: 5ZODivepXvKgSgAAK/x8kA (22 characters; but with a bonus: "Dive" is in there! Cool! It must be a sign...)

Of course, the most common way to represent it (and the way most people provide them to GBIF) is in the so-called canonical textual representation: e593838a-f7a9-5ef2-a04a-2bfc7c90771f (36 characters) This form is already embellished with an additional four characters (hyphens) that are not actually part of the 128 bits of the identifier itself. They're added for the benefit of human eyeballs, presumably because breaking it up into a 8-4-4-4-12 template is less scary to humans (there are other technical reasons, but but important ones, in my opinion).

Microsoft unhelpfully represents them sometimes using upper-case letters: E593838A-F7A9-5EF2-A04A-2BFC7C90771F (also the form I regretfully chose for rendering as ZooBank LSIDs) Or, even worse, with curly brackets: {e593838a-f7a9-5ef2-a04a-2bfc7c90771f}

I get why it's useful in the context of RFC 4122 to pre-pend them with the aforementioned metadata (urn:uuid:), as you advocate: urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f And honestly, other than the canonical text form, I could be most easily persuaded to embrace this form (it's certainly better than pre-pending LSID metadata, as in something like urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC)

But here's my point: the actual identifier is 128 consecutive 1s and 0s -- which is how most database systems actually store them on disk, in the form of 16-byte numbers. However, they're almost always presented (and consumed) as text strings -- usually UTF text strings, which make them a whopping 576 bits in canonical form. So basically, we're consuming 4x as many bytes as the actual identifier, just to make them a little bit more human-friendly.

You could argue that the form urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f makes them even more computer-friendly (at the cost of only an additional 144 bits... more than the actual identifier itself, BTW), but I would argue "not really". While I know the intent of the RFC 4122 system was to allow computers to automate things, I'm not sure how much it's caught fire broadly among people who process this information (and write code to process this information). I bet the first thing that a lot of developers (most?) would do with urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f is strip off the first 9 characters, then write the rest of the code based on matching the canonical text form. And even without the prefix, it's not too hard to incorporate a regular expression to identify a UUID within any text string (including those from PLAZI, which typically lack the hyphens).

OK... like I said, I don't want to hijack this thread with a diatribe about identifiers, but it appears that ship has already left the barn (or something like that).

tucotuco commented 3 years ago

@deepreef This looks like a solid proposal. I took pause at first at the "and potentially other MaterialSamples were derived, or which they collectively comprise" in the definition. It seemed odd to refer to other entities than required to define the concept, but these additions really do help to nail down more broadly how to use the term in practice, and they do nothing to obscure the immediate concept, so I end up quite liking it. Is that example a real identifier for a MaterialSample somewhere? I try to make sure the examples are real. If it isn't, can we use that provided by @dagendresen? Thanks Dag for the great illustration of usage.

deepreef commented 3 years ago

Thanks, @tucotuco

I took pause at first at the "and potentially other MaterialSamples were derived, or which they collectively comprise" in the definition. It seemed odd to refer to other entities than required to define the concept, but these additions really do help to nail down more broadly how to use the term in practice, and they do nothing to obscure the immediate concept, so I end up quite liking it.

Yeah, that's the part of the proposal I was most queasy about. I modelled the definition after the existing definition for parentEventID: "An identifier for the broader Event that groups this and potentially other Events."

I originally had it as:

"An identifier for the broader MaterialSample from which this and potentially other MaterialSamples were derived."

But that seemed incomplete, so I added the extra ", or which they collectively comprise" (to avoid people nit-picking the definition of "derived")

Is that example a real identifier for a MaterialSample somewhere?

Yup! And not chosen at random either (here's a hint: search for occurrenceID 4fed2b94-7fb1-4a49-9315-0810171fc507). I was kinda disappointed that there didn't seem to be any way to search GBIF on materialSampleID (doesn't even seem to show up in the full data record). I wanted to find other real-world examples of what values people are presenting under that term, so I could have more than just the UUID example. I even downloaded ~2M GBIF records (Hawaii records -- I need them for another project anyway) so I could get a sampling of other real-world values for materialSampleID; but I has having trouble importing the download into a database, so I gave up and just entered the UUID. I figured that's the only example given for materialSampleID anyway, so might as well be consistent. Except I chose a different UUID for the example, for entirely narcissistic reasons (in my defense, if I were a true narcissist, would have gone with 65fea8a6-c595-4f5b-adda-d1d176f40e7c - I'll make you wait until GBIF adds support for searching on materialSampleID to see what that one is).

In any case... I've added the example from @dagendresen as a second one (even though I'm queasy on the urn:uuid: thing...)

debpaul commented 3 years ago

Haha @deepreef wrote:

OK... like I said, I don't want to hijack this thread with a diatribe about identifiers, but it appears that ship has already left the barn (or something like that).

Me either. But here goes. Many moons ago Greg and I asked, do we need the prefix? (answer no). Who or what really needs the "urn:uuid" declaration? A machine can figure out it's a UUID. A human can see it? The field itself comes with expectations of what to find in it. The prefix is redundant, no?

tucotuco commented 3 years ago

Related issues are Issue #1, Issue #3, Issue #24 (reopened because of renewed interest), Issue #314, Issue #332, Issue #345, Issue #346, and Issue #347.