icatproject / icat.server

The ICAT server offering both SOAP and "RESTlike" interfaces to a metadata catalog.
Other
1 stars 5 forks source link

Add attributes to data publication related classes #297

Closed RKrahl closed 1 year ago

RKrahl commented 2 years ago

Add some more optional attributes to data publication related classes.

In detail:

Close #295 and close #296.

kevinphippsstfc commented 2 years ago

I notice that the CI is failing. I didn't have time to investigate that so could you take a look.

RKrahl commented 2 years ago

The RelatedItem in DataCite 4.4 is an extended and generalized version of RelatedIdentifier. It is extended in that it has additional subproperties, such as relatedItemType, Creator, Title, PublicationYear, … The purpose of these properties is mostly to allow generating a citation line referencing the related resource. They could in principle be fetched from the metadata of the related identifier, at least if that identifier is a DataCite DOI. It is generalized in the sense that the relatedItemIdentifier is optional, so it is also possible to reference items that do not have an identifier.

To give a concrete example: if a dataset is published with a DataCite DOI and the data has been collected from an instrument, that instrument could be referenced in the DataCite metadata using RelatedIdentifier as:

<relatedIdentifier relatedIdentifierType="DOI" relationType="IsCompiledBy">10.5442/NI000001</relatedIdentifier>

or using RelatedItem as:

<relatedItem relatedItemType="Other" relationType="IsCompiledBy">
  <relatedItemIdentifier relatedItemIdentifierType="DOI">10.5442/NI000001</relatedItemIdentifier>
  <titles>
    <title>E2 - Flat-Cone Diffractometer</title>
  </titles>
</relatedItem>

or using both redundantly. The RelatedItem provides additional information, the type of the related resource[^1] and the name of the instrument, included as title. The same information could be fetched from the referenced instrument DOI metadata. Note that, as one can see in this example, RelatedItem has a subproperty relatedItemIdentifier, so it's not only for the case of a related resource having no identifier.

Adding the additional information directly in the DataCite metadata may be needed, for instance if the metadata should be harvested by B2FIND. B2FIND needs to map the incoming metadata onto the EUDAT Core Metadata Schema. EUDAT Core has a text property for Instrument, so it would be possible to map the title from the RelatedItem onto that.[^2] But (as I learned last week) the mapper from B2FIND does not support resolving external identifiers and fetching metadata from the related DOIs. So if the instrument would only be linked using RelatedIdentifier, it would not be possible to have the Instrument property set in B2FIND.

In order to be able to use RelatedItem, we would need at least relatedItemType and title, as proposed in this PR, because these subproperties are mandatory.

[^1]: That type is Other here, because there is no Instrument in the controlled list of terms for resource type in the current DataCite version. But hopefully it will be added in the future. [^2]: The mapper does support individual per repository configuration. So it would be possible to add a rule: if a dataset relates a resource with IsCompiledBy, then that resource should be taken as an instrument.

RKrahl commented 1 year ago

… continued:

So a RelatedItem does not have an identifier, but we have identifier as a mandatory field in RelatedIdentifier. Also, in RelatedItem title is a mandatory field but in RelatedIdentifier title is optional.

As explained above, in DataCite, RelatedItem may have an identifier, but as opposed to RelatedIdentifier, it is not mandatory. And I hope, I made clear that there are good reasons to use RelatedItem also for resources that have an identifier.

I'm not so sure if there is any use case for linking to resources not having an identifier in the context of a data publication from a PaN facility. So I don't believe, the fact that identifier is mandatory in RelatedIdentifier in the ICAT schema would be an issue in practice.

Which makes me feel that we really should have an additional entity for RelatedItem with the mandatory and optional fields set correctly rather than trying to use RelatedIdentifier for both purposes.

I believe, having separated classes in the ICAT schema for both cases would make things overly complicated. In practice, one might want to add both properties for the same related resource in the DataCite metadata: RelatedItem, because it provides the additional subproperties needed for instance for B2FIND and RelatedIdentifier for backward compatibility, because RelatedItem is relatively new and might not be understood by all consumers of the metadata.

In the end, it will be a site specific script that generates the DataCite metadata out of ICAT, either the script that generates the data publication landing pages or the XSLT file that generates the metadata in icat.oaipmh. The schema as proposed in this PR allows for all options one might want to implement. It is for instance relatively easy to code into the XSLT something like:

Or, if you want to avoid the redundancy, you might put into your code:

A site that doesn't care about RelatedItem may just ignore relatedItemType and title.

RKrahl commented 1 year ago

I notice that the CI is failing. I didn't have time to investigate that so could you take a look.

The CI is always failing for any things I submit. I guess that is a permission issue. I also don't get to see any diagnostic messages, so I can't tell what is going wrong.

RKrahl commented 1 year ago

As discussed with @kevinphippsstfc today, we decided to rename the entity class RelatedIdentifier to RelatedItem with this PR in order to make it clearer what this is supposed to be. For the same reason, I expanded some of the comment strings to provided additional hints what should be put into the attributes.

So the new class now looks like:


RelatedItem

A reference to an external resource or item that is related to a data publication, such as a scientific article that is based on the data or the instrument that has been used to collect the data

Uniqueness constraint: publication, identifier

Relationships:

Card Class Field
1,1 DataPublication publication

Other fields:

Field Type Description
identifier String [255] NOT NULL The identifier of the related resource
relationType String [255] NOT NULL Description of the relationship with the related resource, see DataCite property relationType for suggested values
fullReference String [1023] The full reference for the related resource as it should be displayed on the landing page
relatedItemType String [255] The type of the related resource, see DataCite property resourceTypeGeneral for suggested values
title String [255] Title or name of the related resource

Obviously, the corresponding one-to-many relation in DataPublication has also been renamed accordingly.