inveniosoftware / invenio-app-rdm

Turn-key research data management platform.
https://inveniordm.docs.cern.ch
MIT License
108 stars 148 forks source link

Custom fields #2019

Closed kpsherva closed 1 year ago

kpsherva commented 1 year ago

Analysis

Implementation tasks

### Tasks
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1186
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1191
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1211
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1215
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1217
- [x] https://github.com/inveniosoftware/invenio-rdm-records/issues/1219
- [ ] Analyse custom fields naming ([see](https://github.com/zenodo/zenodo-rdm/pull/203#discussion_r1119683722))
- [x] Analyse custom fields import in invenio.cfg (zenodo-rdm) ([see](https://github.com/zenodo/zenodo-rdm/pull/196#pullrequestreview-1311235440))
- [ ] Analyse export formats (e.g. DataCite) for custom fields ([see](https://github.com/inveniosoftware/invenio-rdm-records/issues/1191#issuecomment-1441699517))
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1228
- [ ] https://github.com/inveniosoftware/docs-invenio-rdm/issues/503
- [ ] https://github.com/inveniosoftware/docs-invenio-rdm/issues/502
- [ ] https://github.com/inveniosoftware/invenio-rdm-records/issues/1230
- [ ] https://github.com/inveniosoftware/docs-invenio-rdm/issues/516
alejandromumo commented 1 year ago

LORY community analysis

Issue: https://github.com/zenodo/zenodo-rdm/issues/179

In the linked issue, every record that belongs to LORY communities (~10k records) was analysed. The following fields are being used and are not migrated in https://github.com/zenodo/zenodo-rdm/issues/102 :

Note that some fields are actually objects, e.g. journal has nested fields like journal.title, journal.pages etc.

alejandromumo commented 1 year ago

Datacite fields analysis

Source: Datacite metadata schema v4.4

All fields

Legend

M - mandatory 
R - recommended
O - optional 

List


# Mandatory fields

1 Identifier (with mandatory type sub-property) M
2 Creator (with optional name identifier and affiliation sub-properties) M
3 Title (with optional type sub-properties) M
4 Publisher M
5 PublicationYear M
10 ResourceType (with mandatory general type description sub-property) M

# Recommended + Optional fields

6 Subject (with scheme sub-property) R
7 Contributor (with type, name identifier, and affiliation sub-properties) R
8 Date (with type sub-property) R
9 Language O
11 AlternateIdentifier (with type sub-property) O
12 RelatedIdentifier (with type and relation type sub-properties) R
13 Size O
14 Format O
15 Version O
16 Rights O
17 Description (with type sub-property) R
18 GeoLocation (with point, box and polygon sub-properties) R
19 FundingReference (with name, identifier, and award related sub-properties) O
20 RelatedItem (with identifier, creator, title, publication year, volume, issue, number, page, publisher, edition, and contributor sub-properties) O

Missing optional/recommended fields

Mandatory fields only

Datacite field Invenio field Comments
Identifier pids.doi.identifier
Identifier.IdentifierType pids.doi.provider
Creator metadata.creators
Creator.creatorName metadata.creators.person_or_org.name
Title metadata.title
Publisher metadata.publisher
PublicationYear metadata.publication_date They ask solely for the year
ResourceType metadata.resource_type
ResourceType.resourceTypeGeneral metadata.resource_type.id (they ask for "The general type of a resource." from a list of values e.g. "Audio")

See an example of a record in Datacite API

Conclusion

We are not missing any of the mandatory fields from DataCite. Some fields (e.g. PublicationYear) differ in definition from what we have.

EDIT: thanks @tmorrell for pointing out that AlternateIdentifiers and GeoLocation are already in invenio's metadata.

tmorrell commented 1 year ago

Couple of comments on the DataCite analysis:

In the serializer we transform the publication date field to just the year, so from the DataCite perspective that field is fully compliant.

AlternateIdentifier is present https://inveniordm.docs.cern.ch/reference/metadata/#alternate-identifiers-0-n

Geolocation is present in the metadata https://inveniordm.docs.cern.ch/reference/metadata/#locations-0-n but not the deposit form. Box and Polygon will hopefully be available in the DataCite serialized output soon https://github.com/inveniosoftware/invenio-rdm-records/pull/1144

alejandromumo commented 1 year ago

Sofware Heritage

Fields are used for software specific records.

source: https://codemeta.github.io/terms/

List of fields

Software warehouse Type Description Source
codeRepository URL Link to the repository where the un-compiled, human readable code and related code is located (SVN, GitHub, CodePlex, institutional GitLab instance, etc.). schema.org
programmingLanguage ComputerLanguage or Text The computer programming language. schema.org
runtimePlatform Text Runtime platform or script interpreter dependencies (Example - Java v1, Python2.3, .Net Framework 3.0). Supersedes runtime. schema.org
operatingSystem Text Operating systems supported (Windows 7, OSX 10.6, Android 1.6). schema.org
developmentStatus Text Description of development status, e.g. Active, inactive, suspended. See repostatus.org codemeta

Missing fields

To be discussed: these fields are related to software resources. Should they be implement as custom fields?

alejandromumo commented 1 year ago

A note on Datacite RelatedItem field and journal fields (e.g. journal).

After speaking with @slint , we realised that RelatedItem can be used as an extension of RelatedIdentifiers. Datacite recommends its usage when the related item does not have an identifier. BUT it can be used even when the item does have an identifer.

Therefore, the implementation of fields such as journal might not need custom fields. Instead, we can extend the field RelatedIdentifeir and later serialize to RelatedItem for Datacite. To be further discussed and analysed.