tdwg / infrastructure

TDWG infrastructure
5 stars 1 forks source link

Decide about fate of lsid.tdwg.org #60

Closed mdoering closed 7 years ago

mdoering commented 8 years ago

TDWG runs a currently non functional LSID resolver and has an active virtual host definition for http://lsid.tdwg.org https://github.com/tdwg/infrastructure/blob/master/vhosts/lsid.tdwg.org.conf

Can we simply remove that virtual host definition?

sblum commented 8 years ago

Unfortunately, I don't have the skill to resurrect the resolver. To the larger point: "what about LSIDs?" I would like to propose that TDWG's commitment to and support of LSIDs should be phased out. LSIDs are not the way forward, and we should clearly state this recommendation and signal that whatever support we are providing will be ending by some date.

peterdesmet commented 8 years ago

Probably good to escalate this decision to @tdwg/exec as a two part decision:

Stan, can you add this to the agenda?

gkampmeier commented 8 years ago

Note that the second paragraph under "Basic Standards Recommendations on tdwg.org states:

The TDWG community's priority is the deployment of Life Science Identifiers (LSID), the preferred Globally Unique Identifier technology and transitioning to RDF encoded metadata as defined by a set of simple vocabularies. All new projects should address the need for tagging their data with LSIDs and consider the use or development of appropriate vocabularies.

If we are going to abandon LSID technology, we will need to devote some thought into stating why, and point people in another direction. I was using this service :(

dkoureas commented 8 years ago

If we are going to abandon LSID technology, we will need to devote some thought into stating why, >and point people in another direction. I was using this service :(

DOIs?

mdoering commented 8 years ago

I have archived the current LSID resolver setup from OWL: LSID setup from owl archived in this repo: 8d2bfef

I would like to shutdown http://lsid.tdwg.org as soon as possible to decomission the owl server, see #66

MattBlissett commented 8 years ago

Continuing from #67 ...

because for a key class of data (taxonomic names) just about everybody uses them to serve RDF in a consistent format

Do they? The two example LSIDs on your tester don't resolve. I think we (Markus or I) should find the logs for the TDWG resolver and see if anyone was actually using it, and not just for curiosity's sake, before anyone puts any effort into resurrecting the resolver.

Broken LSIDs

Working LSIDs

mdoering commented 8 years ago

I was under the assumption that the TDWG LSID resolver did not work since years. But yes, lets check logs

mdoering commented 8 years ago

The current installation is archived here: https://github.com/tdwg/infrastructure/tree/master/lsid.tdwg.org It is using IBM Perl code from 2008, probably this: http://cpansearch.perl.org/src/EKAWAS/lsid-perl-1.1.7/

rdmpage commented 8 years ago

I could provide a working PHP (don't laugh) resolver if needed. One option is to set up a server "in the cloud" and have lsid.tdwg.org point to that. I guess it depends on whether TDWG wants to be responsible for maintaining the service, or just want there to be a service for those luddites like myself who nurse a lingering affection for LSIDs.

mdoering commented 8 years ago

Rod, did you use lsid.tdwg.org or www.lsid.info? Would be nice to not have a dedicated domain for this at least

rdmpage commented 8 years ago

@mdoering Not sure what you mean by "use" I've not had any involvement in running the TDWG LSID server.

nickynicolson commented 8 years ago

If the TDWG LSID resolver is resurrected, please have it use the HTTP protocol correctly. We disabled access to IPNI resources from this source as rather than issuing an HTTP redirect - so that we see the true originator of the request - the resolver grabbed the data and returned to the requester with an HTTP status 200. This means that we were never able to see who was actually using the data, we just got a slew of requests from lsid.tdwg.org I wrote an outline of this problem and suggested fix before we had to disable access - this will be back in the TDWG TAG mailing list archives for April 2009.

timrobertson100 commented 8 years ago

I would suggest TDWG do not continue to promote LSIDs, nor maintain a resolver.

As an identifier resolution mechanism it simply did not catch on, and by and large the few early adopters are largely broken now (see all the examples above). Given this position, I don't see how the organisation can continue to promote and recommend it. That the resolver can go offline for days without anyone noticing highlights that it is not really used. Looking through Google search results for LSID is pretty indicative of how little current movement there is on LSIDs.

I would recommend that lsid.tdwg.org redirect to a page explaining this position, and if someone wished to maintain a resolver, TDWG could link to that site on the page.

I don't think TDWG should take a half way position on this and have lsid.tdwg.org still operate, by some willing person (and it is kind of you @rdmpage to offer) as that implies TDWG support.

I think @peterdesmet is also correct that this needs exec discussion and decision @csparr.

If TDWG do decide to maintain it, we should deploy something we understand and have code for in github. @rdmpage offer of code is probably sensible rather than the old perl version.

rdmpage commented 8 years ago

I agree that LSIDs are moribund. My worry is that we have nothing to replace them with, in the sense of discoverable, resolvable identifiers that return machine readable data. Obviously there are other ways to achieve this (DOIs, HTTP URIs with content negotiation, etc.) but once we kill LSIDs I suspect we will struggle to go through the whole process of providing the same functionality (assuming that discoverable, resolvable identifiers that return machine readable data are something we want). That @nickynicolson mentions IPNI had problems with the service years ago, and that these were not addressed at the time is a tad disappointing.

I think it would be nice if TDWG supported a functioning LSID resolver, perhaps accompanied by a message saying LSIDs are deprecated and the service is provided as a convenience only. In any event it would be sensible for TDWG to make a decision one way or another.

MattBlissett commented 8 years ago

I've had a look at the oldest logfile on the server, which is from the first week of February.

The vast majority of hits to the server are search engine crawlers, which I have excluded. 330 potentially genuine LSID lookups remain.

Of the 330 queries, only 71 are on domains that have a resolver (Ohio State + Marine Species). I haven't tested if the 71 LSIDs actually return anything.

If anyone is still using the LSID resolution mechanism, Rod's tool will help them debug the process — that's more useful for them than what TDWG was providing anyway. An explanatory page on lsid.tdwg.org and a link to Rod's site seem like a good option.

rdmpage commented 8 years ago

That would certainly be one solution that avoids a 404 and is human readable. But it also means LSIDs are dead, and I guess I agree with @timrobertson100 and @peterdesmet that TDWG should probably make an official decision to either support or sunset LSIDs.

stanblum commented 8 years ago

Before we decide, we need to inquire with publishers (e..g, Pensoft, Magnolia) and Zoobank about their use of LSIDs. If they're printed, we may need to provide resolution, no?

ckmillerjr commented 8 years ago

My thoughts: 1) TDWG is a standards body, not a services provider. 2) LSID was recommended by an Applicability Statement from a working group. Another working group should be organized to update the applicability statement, which is a current TDWG standard. 3) GBIF is a services provider but so far uninvolved in ID resolving services.

Chuck

stanblum commented 8 years ago

This recent workshop on getting traction with technology for identifiers is very relevant to our issue of replacing LSIDs. (Some TDWG people were there, but not from Exec.) https://github.com/identifier-services/phoibos2/wiki Practical Hacking On Identifiers at BiOSphere2 (PHOIBOS2), Feb. 17-19, 2016, Oracle, Arizona, USA

CynthiaParr-USDA commented 8 years ago

This is no longer a blocker. We have decided to shut down the server with the resolver on it, and will be distributing messages related to that on the website and via TDWG-content mailing list.

CynthiaParr-USDA commented 8 years ago

Messaging now a task in #69

@rdmpage I don't think we want to be in the business of maintaining an LSID resolver. We haven't declared LSIDs dead, we'll actually want to turn attention to that applicability statement now. It certainly seems likely. I think we've come to agree with @timrobertson100 's idea that we redirect to a page with a message. If there is an alternate resolver someone else is willing to maintain that would be great and we could refer users there.

I am wondering about the reference to the dead sourceforge site referred to by https://www.lsid.info

CynthiaParr-USDA commented 8 years ago
ckmillerjr commented 8 years ago

The lsid.infohttp://lsid.info domain belongs to TDWG with Stan as the contact.

The DNS servers listed for the lsid.infohttp://lsid.info domain are: mailserver.nhm.ac.ukhttp://mailserver.nhm.ac.uk ns.tdwg.gbif.orghttp://ns.tdwg.gbif.org

ns.tdwg.gbif.orghttp://ns.tdwg.gbif.org doesn't ping.

Chuck

On Mar 18, 2016, at 6:05 PM, Cyndy Sims Parr notifications@github.com<mailto:notifications@github.com> wrote:

You are receiving this because you commented. Reply to this email directly or view it on GitHubhttps://github.com/tdwg/infrastructure/issues/60#issuecomment-198561719

mdoering commented 8 years ago

Chuck, the DNS is now managed in Amazon not GBIF and we have not moved the lsid.info domain, see #68 If we want to keep the domain let us know where it should be pointing to (e.g. http://bitbucket.matt.blissett.me.uk/www.lsid.info/)

MattBlissett commented 8 years ago

Let's not point it to my Raspberry Pi ;-). It can go on the TDWG server.

I've added an additional DNS zone to our Amazon Route 53 for lsid.info. @stanblum, could you change the delegation at the registrar?

ns-1130.awsdns-13.org.
ns-541.awsdns-03.net.
ns-233.awsdns-29.com.
ns-1827.awsdns-36.co.uk.
stanblum commented 8 years ago

@MattBlissett, I understood that to mean I should use the DNS entries you listed there, so (copied from what is now entered with our registrar, Network Solutions):

lsid.info currently points to Domain Name Server (DNS) ( Edit ) ns-1130.awsdns-13.org ns-541.awsdns-03.net ns-233.awsdns-29.com ns-1827.awsdns-36.co.uk

I note, these are different from the entries now used for tdwg.org

tdwg.org currently points to Domain Name Server (DNS) ( Edit ) NS-1521.AWSDNS-62.ORG NS-2002.AWSDNS-58.CO.UK NS-112.AWSDNS-14.COM NS-954.AWSDNS-55.NET

If that's correct, we're done.

MattBlissett commented 8 years ago

That's exactly right Stan, Amazon give different name servers for different domains. It's resolving, all that's needed now is for either Markus or me to make an Apache virtual host for lsid.info, and either copy over the single page from the old site, or redirect to the page which #69 should provide.

MattBlissett commented 8 years ago

I have restored the homepage of http://www.lsid.info/. It's the page as it was some time ago, but with the broken links and broken "news" section commented out.

stanblum commented 8 years ago

Thanks, Matt. I think that takes care of the lsid.info portion of this.

TDWG still needs to decide what to do about the resolver (lsid.tdwg.org): planning and implementing a more graceful deprecation of LSIDs. That's a big issue that will spawn several tasks.)

CynthiaParr-USDA commented 7 years ago

FYI I just ran into LSIDs in the wild (a Pensoft publication) and curious, I tried to follow the link to see what would happen.
http://jhr.pensoft.net/articles.php?id=4989 screen shot 2016-08-04 at 5 24 02 pm

We still need to finish dealing with this.

rdmpage commented 7 years ago

And so we're back to should TDWG support a LSID resolver...

I think it sends a bad message to simply abandon resolving LSIDs, especially given that people are still using them.

The old TDWG resolver used some Perl scripts, I have working PHP code to could be used to provide a working resolver. We could be cleverer and add a simple triple store backend to cache the resolved metadata (if we'd done this at the start we'd have a triple store by now with a big chunk of fundamental nomenclatural data). Maybe an elegant solution would be to "dockerise" a service.

Hosting is an issue, I could run a resolver on a hosting platform but don't necessarily want to fund that long term.

mdoering commented 7 years ago

Hosting on the tdwg machine@gbif should be fine if we got working code easy to deploy. we could run it next to the typo3 website if its also php. Thats very simple, but the docker container would be nicer in the long term

rdmpage commented 7 years ago

Note that LSIDs also occur in PubMed Central content, e.g. urn:lsid:zoobank.org:act:469B5A6C-6773-4E35-8F0D-5CF4EE85D658 appears in http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3677382/ (http://dx.doi.org/10.3897/zookeys.280.3906 ). TDWG saying "we won't support LSIDs" is a little like CrossRef saying "oh well, DOIs were fun, but we've decide not to support them anymore...". OK, I exaggerate a little, but to promote an approach, see it adopted, then causally abandon it speaks volumes about our commitment to identifiers for the things we care about.

screenshot 2016-08-08 17 59 37

rdmpage commented 7 years ago

OK, I've put together a simple LSID resolver that attempts to resolve a LSID and display result the RDF in JSON-LD, n-triples, and XML. Live demo is here: http://bionames.org/~rpage/lsid-resolver-php-o

I need to add a support for content negotiation so it can be used as an API to resolve LSIDs, and clean up the code a bit. If this is useful I could look at packaging it up and making it available to be deployed if, say, TDWG wants to support LSIDs (whether it continues to promote them, or deprecates them but provides support for the protocol as long as projects still use them).

As an aside, it looks like several taxonomic name projects fully support LSIDs (e.g., those based on the "species file" platform, and others have broken the DNS-lookup part of the protocol, but still serve RDF, so we can mimic LSID resolution with a few minor hacks.

gkampmeier commented 7 years ago

We have LSIDs that we had used TDWG's resolver for. I've chosen a couple but it seems to get stuck at "Resolving..." Does this mean that it is broken or in need of a further hack to work or is it that the URL is not updating when I hit go (although replacing the URL with the correct urn:lsid didn't help)? Am using Chrome on a Mac 10.10.5.

Thank you for working to support legacy content!

rdmpage commented 7 years ago

Hi @gkampmeier, can you post the LSIDs here and I'll take a look and figure out what the problem is?

gkampmeier commented 7 years ago

urn:lsid:taxonomy.org.au:TherevidaeMandala:MEI024058

Should return Neodialineura trichidion Winterton, 2009

AUSTRALIA, Western Australia, Leeuwin Naturalist National Park, Cape Naturaliste, Yallingup, Yallingup Caves, [-33.65, 115.033], 23.XII.1979, hand netted, R. M. Bohart

rdmpage commented 7 years ago

@gkampmeier I've tweaked the code to show error messages a bit more clearly. The LSID urn:lsid:taxonomy.org.au:TherevidaeMandala:MEI024058 fails because the resolver expects the domain taxonomy.org.au to have a SRV record telling it where to find the location of the LSID server for that domain.

It looks like taxonomy.org.au is no longer active (http://taxonomy.org.au resolves to site full of adds), so any LSID associated with that domain is likely dead :( This means that the LSIDs associated with http://wwx.inhs.illinois.edu/research/mandala/therevidwebmandala/ and the paper "Revision of the stiletto fly genus Neodialineura Mann (Diptera: Therevidae): an empirical example of cybertaxonomy' http://www.mapress.com/zootaxa/2009/f/zt02157p033.pdf are broken (sigh).

gkampmeier commented 7 years ago

Sigh is right. I'll pass on the news to Shaun Winterton. Thanks!

rdmpage commented 7 years ago

More examples of LSIDs in the published literature:

urn:lsid:zoobank.org:act:540D306B-7EAC-4AD9-B072-842AC26F91F7 http://dx.doi.org/10.1371/journal.pone.0152454

urn:lsid:zoobank.org:pub:CBAD704B-64F6-421B-BC71-74DF4620DB4E http://africaninvertebrates.org/ojs/index.php/AI/article/view/395

urn:lsid:zoobank.org:pub:FCD51F6F-A5D6-4466-ADB6-0ADC9F560F66 http://dx.doi.org/10.5852/ejt.2014.105

urn:lsid:zoobank.org:act:CB2FA2C6-0F86-4677-964A-AAF85C6 D960A http://verlag.nhm-wien.ac.at/pdfs/117A_095100_Neubauer.pdf

urn:lsid:catalogueoflife.org:d782a602-29c1-102b-9a4a-00304854f820:col2012acv16 http://dx.doi.org/10.1186/2041-1480-5-40

MattBlissett commented 7 years ago

Another option is a resolver done entirely in Javascript: http://codepen.io/MattBlissett/pen/dXagbB

rdmpage commented 7 years ago

Now that's just showing off ;) Very cool, how stable are https://crossorigin.me and http://dig.jsondns.org likely to be...?

mdoering commented 7 years ago

Here is the TAG message from Nicky in 2009 about using redirects: http://lists.tdwg.org/pipermail/tdwg-tag/2009-April/000358.html

I am a bit lost if there is a consensus to run a resolver at TWDG now and what that resolver should do. Personally I think now it would be good to host one and be responsible for what was done only some years ago. If many LSIDs are not resolving because of broken domains and services TDWG at least tried it's best. That the lsid.tdwg.org resolvers http URL was used in published LSIDs instead of the pure LSID URN is unfortunate, but also shows that people prefer clickable and well known links.

rdmpage commented 7 years ago

Here's the comment from @nickynicolson formatted in a readable way:

Hi,

Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.

I'll outline a situation we had recently:

The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).

The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).

On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.

One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.

I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.

I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.

cheers, Nicky

rdmpage commented 7 years ago

If I read @nickynicolson correctly the issue is volume of traffic caused by LSID resolution. One solution is to cache the result (for example, cache the RDF as a file, some other persistent storage, or put in a triple store). The other is to respect the Crawl-delay: directive in some way, although that may hamper the use of LSIDs. For example, http://ipni.org/robots.txt looks like this:

User-agent: *
Crawl-delay: 30

User-agent: Slurp
Crawl-delay: 30

User-agent: Teoma
Crawl-delay: 30

User-agent: msnbot
Crawl-delay: 30

User-agent: Gigabot
Crawl-delay: 30

User-agent: ConveraCrawler
Crawl-delay: 30

Respecting this limit (one visit every 30 seconds) would render LSIDs barely usable. Perhaps some increased (but not excessive) usage would work (combined with caching).

rdmpage commented 7 years ago

Re @mdoering comment on TDWG views on LSIDs, I'm happy to contribute time to set a resolver up.

MattBlissett commented 7 years ago

If we do prodived a resolver, I think it should do the minimum necessary: look up the LSID's authority, query to find the server, and send an HTTP redirect to the data.

A redirect makes selectively blocking problem users possible, which I think is useful for IPNI, and a web spider would query for /robots.txt before following the redirect.

mdoering commented 7 years ago

A redirect would also allow IPNI and others to know who is issuing the original request. I understood @nickynicolson this is even more important than the throttling. I can definitely see this being very useful information.

Having not dealt with LSIDs for a while, the redirect goes to the metadata, not data, right? I recall there ain't no data in our domain apart from images ;)

rdmpage commented 7 years ago

The down side is that there's no web-friendly interface that displays the RDF. The failure to do this is, I think, one reason LSIDs failed. Outside the small group who follow these sorts of discussions, hardly anybody knows what RDF is, or what to do with it. I've always thought that if we had a richer client, for example one that could "learn" from the metadata it is resolving and add some value, we wouldn't have ended up with orphaned identifiers in the first place.

rdmpage commented 7 years ago

Perhaps I'm conflating two issues here: (1) need to support LSIds as they are in the wild, and (2) understanding why LSIDs failed to take off. But, I'd argue if we fail to learn the lessons then we'll end up repeating this whole mess. DOIs started at about the same time as LSIDs, one is now ubiquitous, one is moribund. Why? The answer isn't just that there was money behind DOIs...but I digress.