inveniosoftware / invenio-app-rdm

Turn-key research data management platform.
https://inveniordm.docs.cern.ch
MIT License
100 stars 144 forks source link

OAI-PMH: Use scheme for identifiers #1976

Closed SotosTsepe closed 1 year ago

SotosTsepe commented 1 year ago

Package version (if known): v10

Describe the bug

An entry for an identifier looks like this: <dc:identifier>10.81088/56s5n-a5j67</dc:identifier>. A correct identifier should either use the https:// or doi: scheme. Example: <dc:identifier>https://doi.org/10.81088/56s5n-a5j67</dc:identifier> or <dc:identifier>doi:10.81088/56s5n-a5j67</dc:identifier>

github-actions[bot] commented 1 year ago

This issue was automatically marked as stale.

max-moser commented 1 year ago

The background for this issue here is that we want to have BASE harvest our metadata via the OAI-PMH endpoint, similar to OpenAIRE. However, this doesn't work currently because the way InvenioRDM provides the DOI is different from what BASE expects.

max-moser commented 1 year ago

Note: Adding the https://doi.org/ prefix to idutils.normalize_doi() will result in issues in InvenioRDM, e.g. the deposit form:

Image

This happened after:

max-moser commented 1 year ago

Some information that I've dug up:

The OAI-PMH metadata formats are configured in the configuration variable OAISERVER_METADATA_FORMATS. In a default InvenioRDM installation, all configured formats reside in invenio_rdm_records.oai. These serializers are called on the records' dumped JSON (either service.read(...).to_dict(), or effectively the _source key for each hit in search results).

Potential spots where the identifier scheme could be added to the identifier include:

max-moser commented 1 year ago

Update: It might be that this issue only really applies to the Dublin Core export (oai_dc), because the other metadata formats (datacite and oai_datacite) have a kind of identifierType attribute on the identifiers' elements. @slint mentioned that other repositories do this for the DC export as well (e.g. see the result of the DC export on the bottom for e.g. this dataset: Bottom > Exports > DC). We could update our own DC exporter to create such attributes as well. This would also solve the questions regarding identifiers that neither have a URL form nor a scheme prefix (e.g. EAN13). @SotosTsepe has sent an email to BASE, asking for more details and if that approach would be fine with them.

According to the Crossref DOI display guidelines, a DOI should always be displayed as a full URL (https://doi.org/10.xxxx/xxxx). This also suits some parts of the landing page UI very well, but in the small blue DOI badge in the boxes on the right it looks weird. This could also be achieved more selectively by using idutils.to_url() on the DOIs on the landing page only where we want it.

max-moser commented 1 year ago

Update: We got a response by the BASE team which says that:

Further, I had a look into the Dublin Core specs 1.1 as well as the newer DCMI metadata terms. Neither document mentions any kind of identifierType attribute.

Also, I couldn't quickly identify any standard regarding the expected shape of the JSON version of the Dublin Core export. Thus, it wasn't clear to me whether or not any changes to the export structure would break existing software, and I figured the best way to go ahead would be to keep the changes as minimal as possible.

https://github.com/inveniosoftware/invenio-rdm-records/pull/1257 only touches the DC serializer itself, and keeps the export structure the sames as before.

This doesn't touch the (normalized) identifiers in the basic record schema dump (record_item.to_dict()) and thus, keeps the urlerization :tm: of the identifiers an opt-in. If we want to display full URLs for the DOIs on the landing page (like recommended in the previously linked Crossref DOI display guidelines), we have more control over when & where to do so.

slint commented 1 year ago

I think I got carried away by what I saw on the FigShare export on their landing page (see example download DC format).

On their OAI-PMH though for the same record (see XSLT formatted and also the page source), there is no identifierType attribute, and the DOI is not prefixed or formatted as a URL...

Harvard Dataverse also doesn't have the attribute (see example), but they do format the DOI as a URL.

On Zenodo currently, we're rendering <dc:identifier> without any prefix, and alternate identifiers end up in <dc:relation>, with some fancy info:eu-repo format (no idea where this comes from, have to ask the Zenodo e̷̡͓͉̲̭͋̏͐͝͝ͅͅl̴̨͉͉͔͚̩͒̉͑d̴͉̺͖͌̎̑͌̚͠e̶͖̗̣̦̒͌͊̌̔r̷̤͉̥͉̞͕͎͕̒̾̅͒̌̌ş̷̡͈̖̰̪̒͌̆̑͐́̾). I think we're a bit in a pickle... For Zenodo we don't want to break compatibility of the format, since it's being harvested a lot via OAI-PMH, and that would break existing clients.

That means that we're looking for a subset of these options:

max-moser commented 1 year ago

Given that the new PR only touches the DublinCoreSchema (in invenio_rdm_records.resources.serializers.dublincore.schema), option 1 shouldn't be too much of a problem.

The logic in invenio-oaiserver checks if the configured metadata format serializer is a tuple and if so, uses the second entry as keyword arguments to the imported function. This is done in the default configuration of Invenio-OAIServer, and could be done in the default Invenio-App-RDM configuration as well.

I'm thinking about passing the second part of the configuration as schema_context to the serializer (and further to the DublinCoreSchema), which could be used in the schema to determine whether or not to dump the prefixes. What do you think of this?

slint commented 1 year ago

I agree, if we can easily pass the schema_context in the REST API of the resources, I think we're good then.

There are some similar things we might have to do about passing context/config in https://github.com/inveniosoftware/invenio-rdm-records/issues/1231, so maybe check-in with whoever is on it to see that we don't clash.

max-moser commented 1 year ago

Alright, updated the PR and tested it; looking good. Other than the reordering of the identifier entries (PIDs first now), the system can be made to behave exactly as before with a line or two of config.


Comparing https://127.0.0.1:5000/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:my-site.com:p8308-48f88...

"prefix_identifier_schemes": True, "urlize_identifiers": True,:

<dc:relation>https://doi.org/10.9999/rdm.9999988</dc:relation>
<dc:relation>ean13:9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>

"prefix_identifier_schemes": False, "urlize_identifiers": True,:

<dc:relation>https://doi.org/10.9999/rdm.9999988</dc:relation>
<dc:relation>9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>

"prefix_identifier_schemes": True, "urlize_identifiers": False,:

<dc:relation>doi:10.9999/rdm.9999988</dc:relation>
<dc:relation>ean13:9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>

"prefix_identifier_schemes": False, "urlize_identifiers": False,:

<dc:relation>10.9999/rdm.9999988</dc:relation>
<dc:relation>9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>