inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Record persistent identifiers #981

Open kaplun opened 8 years ago

kaplun commented 8 years ago

Problem

As part of the migration to Invenio3 (#816), and in parallel to the open discussion on new URLs to reach records (#937), I'd like to open up the discussion on what official ID we are going to expose for our records, and given then ongoing work on JSONRef (#958) and relations

So far in Invenio we have always advertised the recid, i.e. a simple integer that is unique across all records (being them a Paper, a Person, a Job...).

When two records are merged, one record is declared as deleted, and accessing its recid by visiting http://inspirehep.net/record/123 will result in an HTTP redirect to the final merged record.

So far:

However these IDs can only be searched for, but are not actively used to access directly a given record, and they are not advertised as official canonical IDs (e.g. following the detailed record link in search results brings you to http://inspirehep.net/record/123).

Proposal

A persistent ID is defined per each type of record in INSPIRE e.g.:

The new official URLs would become:

A recid would still be issued for all these document types and such record ID would still be usable to access the given record as in http://inspirehep.net/record/12345.

Additionally all the above mentioned different IDs are different enough to allow us to uniquely address a record via directly using them (e.g. accessing a person via ORCID or BAI or INSPIRE ID).

References across records would be made using the proposed canonical URLs. This would imply that, in the situation where a referred record is merged into an other, the referring one doesn't need to be updated, since the reference would still transparently resolve.

kaplun commented 8 years ago

cc: @inspirehep/inspire-dir

ksachs commented 8 years ago

From the experience of texkeys: don't rely on metadata unless it gives an advantage. There are records where we don't have a year, the year given during harvesting might be wrong and will be changed, ... all sorts of problems. And using date-added might be misleading.

For an all-numeric identifier I don't see a reason to include the year. For texkeys many people remember: a paper by Abbott from 2016

I'm not sure I want to keep CNUMs for linking conferences. It happens too often that a conference is shifted after announcement. Currently we change the CNUM (and corresponding HEP records). So it is only really stable after the event.

Only for Jobs the year makes sense to me. And here I would take date-added (i.e. announced).

Example person: (s)he changes names (due to marriage or whatever) and has to keep the old identifier?

Bottom line: for an identifier with the main purpose to guaranty stable links I prefer a random number or alphanumeric ID.

aw-bib commented 8 years ago

As for recid: if this is the identifier(tm) I strongly suggest to advertise it as this. Say, as permalink for this entry. However I'd probably decorate it as inspire:<recid> to avoid confusion with other bare numbers. Also a query for this id should resolve right to the record. (Thus the : syntax is probably in line with the search engine syntax.)

It goes without saying that this would have to include the commitment that all future versions have such a thing and resolve it properly. For services linking to inspire this would be very important.

I agree with @ksachs that humanly interpretable texts in ids (dates, "real" names etc.) tend to produce conflicts. I'd avoid them. Names tend to be short lived and non-unique. They are probably nice as an alias, but I would not use them as primary id.

For a naming scheme, at join² we use numeric ids with some prefix. Probably this would be suitable for inspire as well? We use <some code> to define what is refered to, (<agency>) as issuer of the id (several agencies may mint an id for the same thing; usually we use ISIL) and then alphanumerics.

etc.

bing13 commented 8 years ago

Excellent synthesis, Sam. I'd also endorse the suggestion that human readable text in identifiers leads to confusion. I've seen it numerous times in how publishers try to generate "meaningful" DOIs. Inscrutable serial numbers are safer (if less emotionally reassuring for some people).

aw-bib commented 8 years ago

A suggestion that came up was to use something like base(30) to shorten lengthy numbers. Ie. allow all numbers and 7bit chars except those that get mixed up by humans (O, 0, l, 1 etc.)

lnielsen commented 8 years ago

My five cents from building https://github.com/inveniosoftware/idutils:

The only identifiers that IDUtils can't properly detect automatically are PubMed IDs and RecIDs (because they are just integers). So whatever you decide to go with, I would highly recommend to avoid pure integers :-)

hoc3426 commented 8 years ago

In JOBS, 50% of them are added by us using bibedit, rather than the submission form, so they don't get the JOBSUBMIT-JOB-2016-201 number.

aw-bib commented 8 years ago

I think minting of a pid should indeed not depend on the ingestion method.

kaplun commented 8 years ago

Yes, @hoc3426 we can generate PIDs indipendently of any submission method. I took the current JOB-XXX as an example of established INSPIRE PID.

tsgit commented 8 years ago

semantic overloading of identifiers is almost always leading to problems. some examples were already given by others, e.g. corrections to "year" or spelling errors in "author" or name changes for other reasons will be better dealt with when there is a level of abstraction. jobs will be deleted (or hidden) once filled or expired. what kind of persistence is useful in this case? OAI-PMH describes 3 different levels http://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords

Alexander asks about merged records in OAI. If merging is a special form of aggregation then the provenance container addresses that for OAI-PMH http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm#Provenance

Instead of a proliferation of semantically overloaded identifiers I vote for the oai identifiers already used by Inspire today. The namespace ensures global uniqueness and the scheme conveys how to resolve it. http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm

T.

aw-bib commented 8 years ago

@tsgit oai-id is fine with me as well.

Just keep in mind that the question arose in a non-OAI context. Its really allowing some external service to link exactly to records in inspire. (Sometimes called deep links.) As the external service usually does not know about internals of inspire it will need sort of a handle to do this. Also the goal is not to provide a search in inspire (thus linking via doi would not solve the issue, as in general I do not know if inspire holds the record at all) but known item reference.

E.g. we (putting on the join² hat) import data from inspire to local systems, add further details required (say funding infos) and now I want to link back from our system to inspire. For pubmed, e.g. we store the pmid together with source info (to avoid the very valid issue pointed out by @lnielsen). Thus display formats, exports etc. can use the info to link back. See eg. https://bib-pubdb1.desy.de/record/276466. Unfortunately, the much better link https://bib-pubdb1.desy.de/search?p=id:%22PUBDB-2015-04674%22 does not yet resolve to the record as it is just a search. This would be addressed by the "new official urls" proposed by @kaplun (.../id/PUBDB-2015-04674)