openaire / guidelines-cris-managers

OpenAIRE Guidelines for CRIS Managers based on CERIF-XML
https://openaire-guidelines-for-cris-managers.readthedocs.io/
6 stars 16 forks source link

OAIPMHID - the OAIPMH identifier twice for the same record #85

Closed ACz-UniBi closed 3 years ago

ACz-UniBi commented 4 years ago

Identical CRIS, different endPoints (context views) to the records. Could have two identifiers for the same/unique record in the CRIS:

Example:

Should more investigation.

jdvorak001 commented 4 years ago

The Publications/ bit is necessary to distinguish the IDs of publications from other entities that may be exported from a CRIS (such as projects, orgunits, persons, ...).

In the case where there are two records from the same source describing the same item, coming through the CRIS and the Literature Repo interfaces, they should not be difficult to de-duplicate.

jdvorak001 commented 4 years ago

@jdvorak001, see if there is a real case for that.

abollini commented 4 years ago

I support removal of the mandatory syntax for the identifier. If a CRIS system requires to prefix the id with the type of entity because it hasn't a better identifier that's fine but we should not mandate such specific format. The type of entity should be understandable by the harvester looking to many other information such as the detailed xml representation or the set where the record is included. To support such specific identifier we have had to extend dspace-cris in a way that slow the indexing performance as essentially the data to be exposed over oai-pmh (regardless to the oai context) must be in a SOLR core where the document id is the oai identifier. The solution was so to index all the documents two times one with the "normal" identifier, one with the special "cris" identifier and exclude from each context one or the other document. Again I don't see any extra value in mandate a specific format for the identifier to the cris system but instead it makes more difficult to implement support for the guidelines in real system.

olli-gold commented 4 years ago

I second @abollini. It should not be mandatory to use the ID to indicate the type of a record, as this can be done in the other metadata with a special type field. Just my opinion as a technically focussed repository maintainer.

jdvorak001 commented 4 years ago

Yes, as long as we live within Publications alone. If we extend to Patents and Products (so most notably research datasets and research software), then there is potential of clashes.

And even more likely are clashes with Projects, Funding, Persons, OrgUnits, Events and Equipment. To my knowledge many CRISs still use IDs generated from sequences. Life would be much easier if everyone was using UUIDs.

olli-gold commented 4 years ago

Of course everyone should use UUIDs, but I don't see, why the structure of the UUID needs to be enforced. DSpace CRIS does not need such enforcement, as Andrea explained it's exactly the opposite and that enforcement is causing a lot of trouble, because all publications need to be indexed twice. The old identifiers need to be continued as Harvesters like BASE might have used the identifier already and changing it would cause BASE to create a duplicate unless they notice, that the old record does not exist any longer.

Can you explain, why you think, that there can be clashes somewhere? We can still use Sets or fields in the metadata to indicate the type of something. And if a system needs to use the type to build a UUID, it's still free to do so - no need to force anyone to do it the same way. Or am I missing something?

jdvorak001 commented 4 years ago

After a discussion in the task group the inclusion of type information in the identifier is a possible strategy to handle internal identifier ambiguity in some systems, but it indeed does not need to be mandated for all systems. Some systems use their own ways of maintaining identifiers that are unique across all entities.

We will update the text of the Guidelines with a note on the need to guarantee uniqueness of the OAI identifiers across all entities. This change is backwards compatible, so existing implementations do not need any modification. The change can go in any micro revision of the Guidelines. The constraint shall then no longer be enforce by validators.

jdvorak001 commented 4 years ago

The CERIF XML examples shall be made consistent with this change: the OAI identifiers shall drop the type prefixes (since that's easier than adding the type prefixes to all id attributes within the CERIF XML markup).

abollini commented 3 years ago

@jdvorak001 I have prepared the PR above, can you check?

jdvorak001 commented 3 years ago

Now #90 is merged, so CRIS implementations will be to choose the OAI identifiers however they like. Thanks @ACz-UniBi for pointing this out and @olli-gold for bringing in the additional context.

jdvorak001 commented 3 years ago

I'm pushing a modification of the examples where the unprefixed form of the OAI identifier is used for Equipment. It illustrates this approach and is also useful for validating https://github.com/jdvorak001/openaire-cris-validator/pull/7.