openaire / guidelines-cris-managers

OpenAIRE Guidelines for CRIS Managers based on CERIF-XML
https://openaire-guidelines-for-cris-managers.readthedocs.io/
6 stars 16 forks source link

OAI Identifier uses outdated validation for domain name #126

Closed abollini closed 2 years ago

abollini commented 2 years ago

The oai idenfier are usually generated using the system domain name as repository identifier, this lead for instance to identifier like that oai:dspace-cris.4science.cloud:e9ed438e-c7f7-4a18-95e5-3f635ea65fee

Unfortunately, the oai-identifier.xsd http://www.openarchives.org/OAI/2.0/oai-identifier.xsd , cached by the guidelines here https://github.com/openaire/guidelines-cris-managers/blob/master/schemas/cached/oai-identifier.xsd#L36-L42

doesn't expect to have a number as first letter of a domain. This make the previous identifier invalid, but of course the domain dspace-cris.4science.cloud is perfectly valid

hvdsomp commented 2 years ago

(I am cc-ing @zimeon and @phonedude)

Well, indeed, it looks like something went a tad wrong with the definition of oai-identifier and, AFAIK, this is the first time the problem was brought up.

My interpretation:

The constructs related to domain names in the oai-identifier syntax definition of Section 2.1 of the OAI Identifier Format guideline build upon Section 3.2.2 of RFC2396, specifically these construction rules:

      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

in which we can see that numericals are allowed in all positions of a component of a domain name with exception of the toplabel (TLD, e.g. org, com) in which only alphabeticals are allowed in the first position. But by using the following constructs:

  namespace-identifier = domainname-word "." domainname
  domainname = domainname-word [ "." domainname ]
  domainname-word = alpha *( alphanum | "-" )

the OAI Identifier Format guideline forbids numericals in the first position of all components of a domain name. I can't imagine this was the intention, and, if it was, I do not recall what the motivation could have been.

Possible solutions:

phonedude commented 2 years ago

Yes, I have no memory of this being by design. So I can only assume it was an oversight.

zimeon commented 2 years ago

Looks like an error to me. I also have no memory of an intention here.

I agree that quietly adjusting the schema is relatively easy. However, this change would make it not match the guideline which is weird, so maybe we should edit the 20-year old guideline too??

ACz-UniBi commented 2 years ago

Dear,

I believe it was an evolution of the Internet. RFC2396 published in August 1998 has updated releases like RFC3986 in section 3.2 from January 2005.

hvdsomp commented 2 years ago

@zimeon, I think you’re right that the guideline should be updated too. I convinced myself when noticing that there’s a Document History section in which details of changes can be conveyed. I’m thinking that the only changes needed are in the construction rules (and document version, of course). I would prefer sticking to the reference to RFC2396 instead of more recent RFCs re URI syntax, because, after all, 2396 was the law of the land when oai-identifier was spec-ed and we’re merely correcting the spec, not creating a new one.

jdvorak001 commented 2 years ago

This is done, thanks @ACz-UniBi !