Open emiliom opened 7 months ago
Follow ups, in the order they were sent on Slack and starting with the person who sent it:
@ymgan: I feel you Emilio … I have had very long IDs too. This reminds me of the discussions on this thread: https://github.com/tdwg/dwc/issues/491 Sorry that I don’t have any solutions for this
@emiliom: Thanks for pointing to that discussion, @ymgan It looks really helpful. I'm not expecting solutions here; just pearls of wisdom and a sense of what's been found to be most useful and practical.
@jdpye: My strong preference for meaningful, data-derived IDs comes from a few places, the history of practice at OTN for making meaningful ID fields, the tendency of researchers to use very generic internal IDs for the components of their studies, and my need to find and amend records throughout the pipeline when source data changes.
UUIDs for the sake of guaranteeing uniqueness feels like we are avoiding the work of defining the set of things that makes our record unique. There's no performance penalty for having long ID fields, and they save us far more often than they would ever hinder us as human operators. So it's true, I've never been convinced by the UUID advice.
https://github.com/tdwg/dwc/issues/491#issuecomment-1680804270 this guy knows what's up.
@albenson-usgs: Yeah no need for me to rehash what I already said in that DwC thread but just to say that I think this is a topic that is still very fraught and unresolved. My preference at this particular time is for human-resolvable IDs but I know that's not everyone's preference.
@timvdstap: For what it's worth, I'm on the same page as Abby and Jon!
@ymgan: +1 from me! If they have an occurrence table in their database and adding a UUID field is easy, then we go for that. However, there were times where data provider do not use Occurrence table in their database, but rather constructs the occurrence view table by joining multiple tables. I couldn’t find a way to track this with UUIDs every time they update the dataset. In this case, we asked our data provider to use the columns that are least likely to change (NOT institutionCode coz institute could be renamed, NOT triplets) to create a composite identifier for occurrenceID. Not everyone’s preference either …
My follow up after the input received.
Thanks again for everyone's input! I've been trying to digest the input here and discussions in tdwg/dwc #491. There are just too many relevant topics that come to mind , so I'll stop trying to compile "all" relevant threads and considerations, and will list what I have:
occurrenceID
, in this paragraph. Though I'll reemphasize that I'm not interested solely on occurrenceID
, but also on eventID
. I didn't bother to look up what GBIF or OBIS say about eventID
.
occurrenceID
relevant, too.Alright, enough on ID's! I already have work to do to lay out how my data-alignment code will need to be changed to ensure the ID's I generate on the first version submitted to OBIS are reused in future data-update versions.
From @albenson-usgs: Great summary and resources Emilio! I would definitely advocate for putting this in the issues so the conversation can be found later. I will say we did discuss some of this at the ESIP Biological Data Standards cluster in making the primer. Also, the ESIP Physical Samples cluster is very keen on IGSNs but the global TDWG community doesn't seem to be and I'm not entirely sure why. Alex Hardisty in the EU really wants the Digital Extended Specimens concept instead of IGSN and I haven't spent the time needed to figure out why (or at least that was my takeaway from some meeting I was in where I was asking Alex about IGSN).
No surprise here, but there was already an issue on this topic in this repo, from 2021: #80
Memory lanes upon memory lanes. Most of my advice from back then stands, and I'm particularly fond of my field-by-field explainer in https://github.com/ioos/bio_data_guide/issues/80#issuecomment-967231859
We had missed this resource from the OBIS Manual on "Constructing and using identifier codes"! https://manual.obis.org/identifiers.html
Thanks to the SMBD team at today's meeting for unearthing it. It looks pretty useful.
This thread started on the Standardizing Marine Biological Data Slack on March 20, 2024. As it's of general interest, I'm moving it here so it's accessible to others more openly.
I'm curious to hear what heuristics or rules of thumb others are using to create ID's for the aligned data. I've settled on using UUID's for occurrences and semi-intelligible ID's for events. But even for events it gets a bit crazy because I'm using a hierarchical set of event types (cruise > station visit > sample) and have tried to include some of that hierarchy into the first two types, so ID's get long; for sampling events, the data generator uses unique sample ID's, so I've reused those. I also have used a dataset prefix for event ID's in a probably silly attempt to have the ID's be kind of globally unique or at least easily recognized as belonging to the same dataset. But that also leads to long ID's, and I'm not sure if it's worth it. Thoughts? I know @jdpye had thoughts on this b/c we exchanged a couple of messages on this Slack (now hidden) ...