tdwg / tag

Technical Architecture Group
https://tag.tdwg.org/
5 stars 0 forks source link

Early 2023 meeting to organize an identifiers task group #36

Open baskaufs opened 1 year ago

baskaufs commented 1 year ago

At the 2022-11-07 TAG working session, it was agreed to close the three existing issues relating to identifiers (https://github.com/tdwg/tag/issues/14, https://github.com/tdwg/tag/issues/2, and https://github.com/tdwg/tag/issues/9) and replace them with this one. Please refer to those closed issues for background, suggested participants, and discussion about how the issue of LSID should be handled.

Specifically:

baskaufs commented 1 year ago

Please note that on 2017-10-18, the Executive Committee recorded a decision in their issue tracker (https://github.com/tdwg/exec/issues/90, not publicly viewable) "The XC has agreed (XC1709) to deprecate the GUID AS." However, no notation was made in the standards documents that either the umbrella standard (http://www.tdwg.org/standards/150) was moved into the retired standards category, nor that the LSID Applicability Statement document (http://rs.tdwg.org/guid/doc/lsidas/, one of two documents in the standard) was marked as deprecated. So this deprecation should probably be considered unimplemented.

Note also that on multiple occasions at TAG meetings in 2022 members stated that even if TDWG no longer recommends the adoption of LSID as a technology for new systems, they are in wide use. Therefore it is questionable whether it is appropriate to deprecate the LSID AS.

debpaul commented 1 year ago

@baskaufs Let's say that no one uses a particular database no longer mints LSIDs anymore. However, many LSIDs were published from this resource and have been were used and published. Is it true that

Note added: my hope was to help this particular group understand the need to store these old LSIDs (in perpetuity) even if they won't resolve ever again. The fact they are cited in literature, means that when the related object is referenced, it has that LSID as one particular "identifier" associated with it, that remains useful. This group was just wanting not to import their LSIDs into their new database, but just forget about them.

baskaufs commented 1 year ago

@debpaul: @rdmpage could say better, but my understanding is that LSIDs are still being minted in existing projects (like Zoobank and others) where their generation was established as part of the standard workflow years ago. So even if they don't resolve, they still are basically string identifiers for those objects. So one would treat them as any other authoritative string identifier and keep them.

I think the main point of clarification is that for people setting up new identifier systems that TDWG no longer recommends them as a preferred identifier.

But I defer to the actual experts on the subject.

rdmpage commented 1 year ago

@debpaul @baskaufs

OK, we need to stop thinking that LSIDs are no longer being used! The recent review below lists pretty much every identifier type used in biodiversity informatics, and LSIDs feature several times. Databases such as ZooBank, IPNI, Index Fungorum, WoRMS, SpeciesFile, etc. use them. There are literally millions of LSIDs in the wild, and more are minted each data as new taxonomic names are published.

What has changed since, say, 2005, is that many (but not all) LSIDs are no longer resolvable using the original LSID protocol. Hence web proxies such as https://lsid.io, which aim to make LSIDs still work.

One way to think about the current situation is if DOIs stopped resolving, but people still minted them and included them in papers, etc. This is where we are with LSIDs. If TDWG is going to kill LSIDs, then IMHO it should

I guess it seems bizarre that TDWGs one significant foray into persistent identifiers crashed and burnt shortly before everyone got religion about persistent identifiers...

Agosti D, Benichou L, Addink W, Arvanitidis C, Catapano T, Cochrane G, Dillen M, Döring M, Georgiev T, Gérard I, Groom Q, Kishor P, Kroh A, Kvaček J, Mergen P, Mietchen D, Pauperio J, Sautter G, Penev L (2022) Recommendations for use of annotations and persistent identifiers in taxonomy and biodiversity publishing. Research Ideas and Outcomes 8: e97374. https://doi.org/10.3897/rio.8.e97374

baskaufs commented 1 year ago

@rdmpage Please note that I did NOT say that LSIDs were no longer used. I think it is true that TDWG no longer says that they are the recommended identifier (as it once did).

ianengelbrecht commented 1 year ago

With all the healthy discussion about identifiers happening in various places at the moment (here, TDWG Slack, literature, etc), can anyone recommend a nice (free) online knowledge management system where we could keep this all together? Something like a wiki but also allowing for discussion, debate, comments, differing opinions, points of view, and so on in the knowledge creation process? It would be nice to draw this all together in something a bit more stable/permanent and openly available in anticipation of the proposed meeting. Suggestions welcome.

(Just throwing in my two cents worth, perhaps the TDWG decision to deprecate LSIDs needs to be reconsidered, things may have changed since then. Personally I like LSIDs, they're user friendly, compared to UUIDs for example. That may be useful if our goal is community adoption of identifiers for things, ala ORCIDs for people for example, rather than only as pointers to database records).

tucotuco commented 1 year ago

@ianengelbrecht What is missing using GitHub for the knowledge management system you are talking about?

ianengelbrecht commented 1 year ago

It's an early idea, but this issue is an example - it's a task to arrange a meeting, and we have important information on the meeting topic being added here already. So later on, someone has to refer back to this issue, and various other places, to find and synthesise all the information being offered. It'll be less discoverable when the meeting is had and this issue is closed too. Also, I'm not sure that Github issues are the ideal means of illiciting inputs from a group of people this early on in a discussion. We tried to do something like this with the tdwg/apis group recently. We had a fruitful and vibrant meeting about how APIs should work. The core questions and topics were identified, taken across to Gihub and one issue created for each. But it kinda died after that. I'm a fan of collaborating on Google Docs, where good discussion on a topic is possible with comments, and the document updated accordingly, but a Google Doc becomes unwieldly when there's lots of information, lots of people, lots of discussion and lots of editing. I see Notion are touting their platform as a wiki option, with commenting functionality. I'm sure there are other possibilities too. All just thoughts at this stage.

debpaul commented 1 year ago

@rdmpage @tucotuco @ianengelbrecht @baskaufs please note my comment/question about LSIDs above was raised by me, in a meeting this morning with taxonomists.

Scenario: what to do with LSIDs that were minted in a given database (and published) in the past, but will no longer be resolving.

I never meant to imply that they would not be used by everyone / anyone ever again ... (I'm editing the above to clarify).

debpaul commented 1 year ago

@ianengelbrecht I know exactly what you mean. Anyone entering this world now (See above about folks who were new to the "identifier" vs "resolution" ideas) has a difficult time stepping in, when they read a GitHub thread like this one. Darwin Core Hour ... wiki might be one model solution. Another could be the GitHub "discussion" page where we could move these longer conversations and annotate them. ...

baskaufs commented 1 year ago

I realize that the GitHub system has its issues of people not knowing how to use it and things being somewhat fragmented. But it has the huge advantage that work does not get lost, which soon becomes a problem when a project grows. Also, it is the repository of record for making public and preserving the work of TDWG groups. So if you use another system, you'll have to figure out later how to move the significant content to GitHub for archiving.

One thing that I think works relatively well is to do editing and hashing out in Google docs, then exporting them as PDFs and uploading to GitHub when they are no longer being worked on. GitHub will render PDFs fine and with an organized directory structure, things stay findable.

Some other system might be better, but somebody has to set it up and depending on how complicated it is, it may be no easier for people to use than GitHub. Another thing about the Issues Tracker feature in GitHub is that if you make good use of the tags and milestone features, it is reasonably easy to keep track of what's going on.

I haven't used the other features @debpaul mentioned (wiki and discussion). They would have the advantage of archiving the work automatically. One thing I've observed about the GitHub wiki is that because they haven't been used much in TDWG, people don't think of looking at them. So if the wiki is used, one would want to say prominently on the repository landing page that people should look at the wiki to see the content.

rdmpage commented 1 year ago

There's a bunch of things to unpack, for example:

Perhaps we could tease these apart and offer guidance on each?

For example, I had a good discussion with @mdoering during TDWG 2022 about identifiers.org which provides standardised ways to refer to PIDs independent of particular ways to resolve them.

We should provide a summary of the main contenders for PIDs (pros and cons), especially in terms of what work would be involved in each case (e.g., HTTP URIs are free, but you have to ensure they persist, DOIs use indirection so you can change URLs at will so long as you update DOI, etc.). Maybe cover DOI, Handle, ARK, HTTP URI, and LSID. Give actual examples with pros and cons.

Discovery matters, otherwise people aren't likely to use other people's PIDs (which means we don't get any real benefits from PIDs). In many ways this is analogous to geocoding - going from a locality description to (lat,lon) coordinates.

Resolution matters if anyone wants to build something on top of PIDs, it's also the best way to see if a PID actually means anything. If you don't make them resolvable you have no skin in the game, which implies the PID has no value to you (so why would anyone else care?).

Perhaps where this is heading is a (hopefully) short document that sets out all the questions the TDWG community should be asking (basically the four above), coupled with a set of possible answers from which they could make a choice (or at least use to start the decision making process).

ianengelbrecht commented 1 year ago

This is a summary of the discussions held on TDWG TAG Slack channel to date, so we have it here for posterity:

Rob Sanderson [4:18 PM] My position statement, if you will, on identifiers: Just use HTTPS IRIs. All URN based mechanisms have devolved down to HTTP as a protocol for resolution and delivery, but have failed because there has to be someone to look after the resolution service unless it's a core part of the global infrastructure ... e.g. DNS. Some examples: info URIs: https://en.wikipedia.org/wiki/Info_URI_scheme Floated about 2000, standardized via NISO and opened in 2003, closed in 2010, and now just gone. PURL: OCLC ran the resolver for a long time, with sometimes extended downtimes, and with some software updates that floundered circa 2010. 2016 OCLC gave up on it and the Internet Archive now supports it. Handle system: CNRI operated it for ~20 years, and then handed it over to someone else. All are based on the same principle: have some domain specific rules about how to format a string, and some bespoke service that can interpret that format which accepts requests via HTTP. They're no different to bit.ly, tinyurl, or any other link redirection service. There is only one pseudo-problem that they solve: That the institution which mints the identifier might not want to (or be able to) continue to support the DNS entry for their domain. This happens a lot in academic publishing, and hence the success of DOI, as journals move around, small publishers go out of business or are eaten by larger fish, and so on. But the problem it does not solve is persistence. Persistence is a social and financial problem - the institution must be willing (social) and able (financial) to keep a system running. 1 [4:22] Which is basically John Kunze's critique, echoed here: https://en.wikipedia.org/wiki/Archival_Resource_Key#History [4:23] Of course ARK (IMO) falls into the same category, it just explicitly recognizes the weakness through the "promise of stewardship" requirement

Rob Sanderson [4:30 PM] DIDs (https://www.w3.org/TR/did-core/#did-resolution) are very similar with a subtle difference - they carry enough metadata to be verifiable such that immutability is possible, rather than relying on the good will of the resolver to not redirect the user somewhere undesirable. But the infrastructure costs are much higher, making the persistence less likely outside of commercial application. Good for credentials and micro-credentials, not good for profit-less natural history specimens (or indeed anything in the heritage sector, really) (edited) [4:33] (My background is the digital library domain, from whence most of the above came from, and was part of the DID WG in the W3C)

Roderic Page [6:13 PM] It’s tempting to rehash a decade or more’s worth of arguments about identifiers, and I for one would like to avoid doing that. I don’t think the argument is “which one?” it’s more “what are you trying to achieve?” I’m also don’t buy that persistence is simply a social and financial question because that implies that technical choices don’t matter. Arguably one of the reasons DOIs have worked well for publishers is that (a) they are based on indirection and (b) carry minimal branding, which means there is a degree of resilience baked into the system. So I guess I’d argue that discussing identifiers only makes sense in the context of the broader ecosystem (the “what are you trying to achieve?” question). I’ve a bit of a rant about this here: https://iphylo.blogspot.com/2020/07/persistent-identifiers-demo-and-rant.html (edited) iphylo.blogspot.com Persistent Identifiers: A demo and a rant This morning, as part of a webinar on persistent identifiers, I gave a live demo of a little toy to demonstrate linking together museum and ... (25 kB) https://iphylo.blogspot.com/2020/07/persistent-identifiers-demo-and-rant.html 3

Rob Sanderson [9:56 PM] Definitely agree on the "what are the requirements" starting place. 21

Deb Paul [11:55 PM] @Roderic Page +1 as we can also see there won't be "just one" identifier anyway. Tuesday, June 7th

Roderic Page [1:58 PM] Apologies for those interested in GUIDs (or PIDs if you prefer). I’ve not had much chance to make in progress on this topic, and I’m away for the next month. I’ve attached the two existing TDWG documents, one on identifier applicability in general, the other focussing on LSIDs specifically. These documents are quite detailed, and as a result quite out of date. At a minimum we could aim to update them as appropriate, but I also think there’s an opportunity to provide some more general guidance about what PIDs can and can’t do, or perhaps more precisely, to emphasise that many of the expected benefits of PIDs do not appear by magic simply by having PIDs. I’d welcome any thoughts on what would be the most useful focus (i.e., what would be the most useful output of these discussions?)

[sic] The applicability statements are available at https://www.tdwg.org/standards/guid-as/

3 replies Last reply 5 months ago Replies to above: Steve Baskauf [5 months ago] I happened to be reviewing issues in the TAG tracker and the issue related to revising the GUID A.S. has some names of people who were interested in participating in a Task Group (at least in 2017). So it might be worthwhile to review that list of names for people who still may be interested.

Steve Baskauf [5 months ago] https://github.com/tdwg/tag/issues/14 GitHub Convene an Identifiers Task Group · Issue #14 · tdwg/tag An open call for nominations. A task group to revise the GUID applicability statement Nominations from: @godfoder @jgerbracht ... (46 kB) https://github.com/tdwg/tag/issues/14

Jonathan A Rees [5 months ago] My opinion: the top priority for such a group is to identify stakeholders (people to whom any work product would make a real difference) and their particular needs and use cases. Without this (a) we'd be burdened with a nagging sense that any effort of this group would (again) be for naught, (b) we'd waste our own time worrying about questions that no one has. (edited)

Sunday, October 16th

Roderic Page [11:34 PM] Hi everyone, in part inspired by TDWG 2022, and after some nudging by Roger Hyam, I’ve released a simple LSID resolver https://lsid.io/ It supports the original LSID protocol (which some taxonomic databases still support, such as WoRMS and the World Spider Catalog), and for other databases that serve data using the RDF vocabularies (such as IPNI, Index Fungorum, Index of Organism Namers, and ZooBank) I “hard code” the resolution step. 21 [11:37] To resolve a LSID just append it to https://lsid/io, e.g. https://lsid.io/urn:lsid:ipni.org:names:77209281-1 Roger was keen for the resolver to behave like doi.org so if possible the resolver sends you to the original site (e.g., IPNI’s page for the taxonomic name with the LSID urn:lsid:ipni.org:names:77209281-1. If you add a “+” to the LSID you get a simple view of the LSID data. For richer data there may be LSID links you can navigate through those by clicking. [11:39] I know LSIDs are deprecated, but in the context of our discussions on GUIDs it baffles me that we overlook the millions of LSIDs we have already minted and which are displayed in taxonomic web sites and in published papers. Monday, October 17th

David Shorthouse [1:52 AM] Nice work! Instead of a “+' (or in addition to), would this be better served via content negotiation, just like doi.org? (edited)

Roderic Page [5:02 PM] @David Shorthouse Problem is the content is HTML in both cases (redirect to source, pretty view of metadata). I use content-negotiation for HTML versus XML. This was all a bit cleaner until Roger Hyam - bless his cotton socks — said he wanted the default behaviour to be redirection to source, whereas initially I always displayed the “pretty” metadata view. The model for “+” is ARK, where if you append “?” you get metadata, “??” you get metadata plus more info. 1 [5:03] [@David Shorthouse] My resolver is a bit more like http://hdl.handle.net/ where you have the option to go to the thing, or get details on the identifier for the thing. [5:05] [@David Shorthouse] So arguably most people will either use https://lsid.io/ to go to the source (if I know what that is or can figure it out - I’ve not fixed this for all the LSIDs), or use content-negotiation to get the XML. The + is really for me because I like the pretty metadata. 2

ianengelbrecht commented 1 year ago

Meeting document and brief notes from meeting held on 21 March 2023 is available here.

marc-portier commented 1 year ago

Reading up on the (not) reached conclusions in the shared minutes ...

I am looking forward to how the TAG is going to propose any next steps in this story.

My take:

The current immobility in this is not helping anybody, and only leads to unresolvable discussions and nonnegotiable positions of well intended people trying to be "right". While some "not entirely wrong but at least helpful" could end up be a practical version of "right enough" ?

To overcome this we could accept the reality of where lsid and lsid-as have landed over time, and stop trying to change or fight that, but instead go into transparent "legacy management" and keep something of a managed list of valid criticisms, and for those (where possible) try to gently advise (or simply document) how people are practically dealing with them?

Just 3 off the bat examples to get us started:

In fact dealing with this in "legacy management mode" might be the thing that liberates the TAG to formulate some alternative that can be "right" and self-motivating towards practically replacing what we have now?

rdmpage commented 1 year ago

Given "strong community memories in relation to the failed life sciences identifier (LSID) scheme" [A choice of persistent identifier schemes for the Distributed System of Scientific Collections (DiSSCo)](https://doi.org/10.3897/rio.7.e67379] yes, legacy mode makes sense.

URN registration doesn't seem to affect anyone. Apart from NBNs I hardly see URNs being actively used (although there are a few made up ones in GBIF).

Fixing R29 and R31 seems irrelevant if LSIDs are being actively deprecated.

This just leaves maintaining a resolver. We need to be clear about the scope if this. Does it only support destinations that still have (semi-)functioning LSID support, or does it offer resolution even if LSID support no longer exists (which is what https://lsid.io does)? Is TDWG going to commit to support this (and what would that look like?).

Perhaps do the following:

baskaufs commented 12 months ago

Here is a technical note on why the GUID and Life Sciences Identifiers Applicability Statements page includes both of the applicability statements.

For historical reasons unknown to me, it was decided to ratify both applicability statements as two documents that were part of a single standard. I think it is because it was originally envisioned that there would be many applicability statements (one for each technology: HTTP IRIs, DOI, etc.) to go with the umbrella GUID AS. That did not happen.

After adoption of the Standards Documentation Specification, it was implemented using this model for relationships among standards components and with the IRI design patterns listed on that page. The SDS decreed that each standard must have a landing page to which the "permanent URLs" for the standards dereferenced. It was decided that the standards pages on the TDWG website would be the landing pages (vs. a GitHub repo README). So if you dereference http://www.tdwg.org/standards/150 it will take you to the standards page on the TDWG website that we are talking about.

The SDS says that a standards landing page must clearly state the parts of a standard. In this case, there are two parts: the two AS documents. They have been assigned permanent IRIs of http://rs.tdwg.org/guid/doc/guidas/ and http://rs.tdwg.org/guid/doc/lsidas/, which dereference to the actual PDFs in GitHub.

So with the current status of the standard, it's not possible to mess with the structure of that page because it's required by the SDS to contain the current information about what's included in the standard. Administratively, we could create a new standard with a new permanent URL and then move one or the other AS documents to that new standard. The question is: which one would be most disruptive to move. If people have been citing the permanent URL of the standard (as they should be), then one or the other of the AS documents would not be found via the standard landing page (although one could put a note there saying that it has been moved.

It seems to me that if the GUID Task Group gets off the ground, it should create a new standard with a new permanent URL and maybe a different name. It would then have its own standard landing page separate from the page for http://www.tdwg.org/standards/150 (the one that lists the LSID AS). The old GUID AS could then be deprecated and removed from the list of docs at http://www.tdwg.org/standards/150 with a note on that page saying that the old GUID AS has been replaced by the new doc (whatever it's called). The header section of the new doc would have an entry with a link to the replaced GUID AS so that it would be easy to find the previous version.

marc-portier commented 12 months ago

thx for these responses @rdmpage and @baskaufs, does feel a bit like the gist of my suggestion got lost in translation?

With "legacy mode" I am hinting at slightly more than a formal deprecation and replacement, I also see the need for some acceptance of the legacy that has been created, and making sense and minimizing cost of that towards those that invested into it?

From that angle, I should maybe rephrase my questions / take up some responses / make myself more clear:

1) not doing the URN registration...

ok, but should be considered against the cost (or chance) of "What if somebody else hijacks the lsid urn at IANA to start meaning something different" -- One could argue that it would not only hinder uses that are slow in switching away from a deprecated standard, but also damage the trust of any future proclaimed "persistent identifier" coming from tdwg?

also: the fact that "they are no longer / not often used" is ignoring the fact that these lsid recommendations have introduced some legacy use. And imho "legacy mode" is about dealing with that fall-out elegantly and responsibly?

2) R29 and R31 similarly....

This has been out there and in use for some time... how to try and deal with that?

So yeah, given the deprecation status it seems logic indeed to guide people towards not using either anymore. In a transition scenario though one could argue that redo R29 to using the proxy form for subjects too makes sense to let them keep working with all external refs that used R31 (even if the proxy form is not a canonical form)

3) and about maintaining the resolver too...

That would translate to: sure, replace the homepage, mention deprecation, better alternatives and "legacy mode guides"

but combine that with some statement that the persistence of the proxy form links is guaranteed (owning your legacy when it is about persistence is important imho)

rdmpage commented 11 months ago

Just for fun, elsewhere on TDWG there is a discussion about the DwC field scientificNameID https://github.com/gbif/pipelines/issues/217 and it's full of LSIDs ;) Yet here we are discussing sunsetting LSIDs...

mdoering commented 11 months ago

Just for fun, elsewhere on TDWG there is a discussion about the DwC field scientificNameID gbif/pipelines#217 and it's full of LSIDs ;) Yet here we are discussing sunsetting LSIDs...

yes, but noone is actually resolving them. They are just unique strings defined in some datasets elsewhere for resolution. In fact the "real" identifier first has to be extracted from the LSIDs URN before it can be looked up. Simple CURIEs would have done the job too.

rdmpage commented 11 months ago

yes, but noone is actually resolving them. They are just unique strings defined in some datasets elsewhere for resolution. In fact the "real" identifier first has to be extracted from the LSIDs URN before it can be looked up. Simple CURIEs would have done the job too.

But if the were resolvable then we wouldn't need to "extract" an identifier, we'd have the identifier already (the LSID). Plus we'd have the ability to check that it was correct (does it resolve to the thing we're talking about?) and potentially learn more (e.g., if the LSID was linked to other information).

Of course, we blew this opportunity by having an identifier that is tricky to implement properly, and we created a cargo cult where it is OK to make up things that look like identifiers ("urn:xxx") but which have none of their properties (e.g., resolvability, machine readability, etc.).

Insert somewhere here about "split milk", etc. ...