Closed SotosTsepe closed 1 year ago
This issue was automatically marked as stale.
The background for this issue here is that we want to have BASE harvest our metadata via the OAI-PMH endpoint, similar to OpenAIRE. However, this doesn't work currently because the way InvenioRDM provides the DOI is different from what BASE expects.
Note: Adding the https://doi.org/
prefix to idutils.normalize_doi()
will result in issues in InvenioRDM, e.g. the deposit form:
This happened after:
Some information that I've dug up:
The OAI-PMH metadata formats are configured in the configuration variable OAISERVER_METADATA_FORMATS
.
In a default InvenioRDM installation, all configured formats reside in invenio_rdm_records.oai
.
These serializers are called on the records' dumped JSON (either service.read(...).to_dict()
, or effectively the _source
key for each hit in search results).
Potential spots where the identifier scheme could be added to the identifier include:
idutils
schema.dump()
/record_item.to_dict()
(that's what was being done in PR #1245 (closed))Update:
It might be that this issue only really applies to the Dublin Core export (oai_dc
), because the other metadata formats (datacite
and oai_datacite
) have a kind of identifierType
attribute on the identifiers' elements.
@slint mentioned that other repositories do this for the DC export as well (e.g. see the result of the DC export on the bottom for e.g. this dataset: Bottom > Exports > DC).
We could update our own DC exporter to create such attributes as well.
This would also solve the questions regarding identifiers that neither have a URL form nor a scheme prefix (e.g. EAN13
).
@SotosTsepe has sent an email to BASE, asking for more details and if that approach would be fine with them.
According to the Crossref DOI display guidelines, a DOI should always be displayed as a full URL (https://doi.org/10.xxxx/xxxx
).
This also suits some parts of the landing page UI very well, but in the small blue DOI badge in the boxes on the right it looks weird.
This could also be achieved more selectively by using idutils.to_url()
on the DOIs on the landing page only where we want it.
Update: We got a response by the BASE team which says that:
oai_dc
metadata schemaidentifierType
attribute would be ignored)Further, I had a look into the Dublin Core specs 1.1 as well as the newer DCMI metadata terms.
Neither document mentions any kind of identifierType
attribute.
Also, I couldn't quickly identify any standard regarding the expected shape of the JSON version of the Dublin Core export. Thus, it wasn't clear to me whether or not any changes to the export structure would break existing software, and I figured the best way to go ahead would be to keep the changes as minimal as possible.
https://github.com/inveniosoftware/invenio-rdm-records/pull/1257 only touches the DC serializer itself, and keeps the export structure the sames as before.
This doesn't touch the (normalized) identifiers in the basic record schema dump (record_item.to_dict()
) and thus, keeps the urlerization :tm: of the identifiers an opt-in.
If we want to display full URLs for the DOIs on the landing page (like recommended in the previously linked Crossref DOI display guidelines), we have more control over when & where to do so.
I think I got carried away by what I saw on the FigShare export on their landing page (see example download DC format).
On their OAI-PMH though for the same record (see XSLT formatted and also the page source), there is no identifierType
attribute, and the DOI is not prefixed or formatted as a URL...
Harvard Dataverse also doesn't have the attribute (see example), but they do format the DOI as a URL.
On Zenodo currently, we're rendering <dc:identifier>
without any prefix, and alternate identifiers end up in <dc:relation>
, with some fancy info:eu-repo
format (no idea where this comes from, have to ask the Zenodo e̷̡͓͉̲̭͋̏͐͝͝ͅͅl̴̨͉͉͔͚̩͒̉͑d̴͉̺͖͌̎̑͌̚͠e̶͖̗̣̦̒͌͊̌̔r̷̤͉̥͉̞͕͎͕̒̾̅͒̌̌ş̷̡͈̖̰̪̒͌̆̑͐́̾). I think we're a bit in a pickle... For Zenodo we don't want to break compatibility of the format, since it's being harvested a lot via OAI-PMH, and that would break existing clients.
That means that we're looking for a subset of these options:
Given that the new PR only touches the DublinCoreSchema
(in invenio_rdm_records.resources.serializers.dublincore.schema
), option 1 shouldn't be too much of a problem.
The logic in invenio-oaiserver
checks if the configured metadata format serializer is a tuple and if so, uses the second entry as keyword arguments to the imported function.
This is done in the default configuration of Invenio-OAIServer, and could be done in the default Invenio-App-RDM configuration as well.
I'm thinking about passing the second part of the configuration as schema_context
to the serializer (and further to the DublinCoreSchema
), which could be used in the schema to determine whether or not to dump the prefixes.
What do you think of this?
I agree, if we can easily pass the schema_context
in the REST API of the resources, I think we're good then.
There are some similar things we might have to do about passing context/config in https://github.com/inveniosoftware/invenio-rdm-records/issues/1231, so maybe check-in with whoever is on it to see that we don't clash.
Alright, updated the PR and tested it; looking good.
Other than the reordering of the identifier
entries (PIDs first now), the system can be made to behave exactly as before with a line or two of config.
Comparing https://127.0.0.1:5000/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:my-site.com:p8308-48f88...
"prefix_identifier_schemes": True, "urlize_identifiers": True,
:
<dc:relation>https://doi.org/10.9999/rdm.9999988</dc:relation>
<dc:relation>ean13:9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>
"prefix_identifier_schemes": False, "urlize_identifiers": True,
:
<dc:relation>https://doi.org/10.9999/rdm.9999988</dc:relation>
<dc:relation>9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>
"prefix_identifier_schemes": True, "urlize_identifiers": False,
:
<dc:relation>doi:10.9999/rdm.9999988</dc:relation>
<dc:relation>ean13:9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>
"prefix_identifier_schemes": False, "urlize_identifiers": False,
:
<dc:relation>10.9999/rdm.9999988</dc:relation>
<dc:relation>9780521425575</dc:relation>
<dc:relation>ark:/123/456</dc:relation>
Package version (if known): v10
Describe the bug
An entry for an identifier looks like this:
<dc:identifier>10.81088/56s5n-a5j67</dc:identifier>
. A correct identifier should either use thehttps://
ordoi:
scheme. Example:<dc:identifier>https://doi.org/10.81088/56s5n-a5j67</dc:identifier>
or<dc:identifier>doi:10.81088/56s5n-a5j67</dc:identifier>