Improve DataCite creator metadata by migrating to newer version of the schema

only1chunts commented 4 years ago

User Story

As a site administrator I want to upgrade to the latest version of DataCite schema So that it increases the details of the metadata we are making available through DataCite

Acceptance Criteria

Given I have a new dataset When I mint a DOI for that dataset Then the metadata for the dataset contains the author information as separate elements within the XML schema:
<contributors>
<contributor>
<contributorName>Starr, Joan</contributorName>
<givenName>Joan</givenName>
<familyName>Starr</familyName>
<nameIdentifier schemeURI="http://orcid.org/" nameIdentifierScheme="ORCID">0000-0002-7285-027X</nameIdentifier>
<affiliation affiliationIdentifier="https://ror.org/03yrm5c26" affiliationIdentifierScheme="ROR">California Digital Library</affiliation>
</contributor>
</contributors>
Given I have a new dataset When I mint a DOI for that dataset Then the metadata for any external links of type "Additional Information" in that dataset are included in the DataCite metadata, e.g.
<relatedIdentifiers>
<relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementedBy" resourceTypeGeneral="dataset">http://doi.org/10.5281/zenodo.3251033</relatedIdentifier>
<relatedIdentifier relatedIdentifierType="URL" relationType="IsSupplementedBy" resourceTypeGeneral="dataset">http://cmso.science/MIACME/</relatedIdentifier>

>Given I have a new dataset 
>When I mint a DOI for that dataset
>Then the metadata for any external links of type "GitHub Link" in that dataset are included in the DataCite metadata, e.g.

https://github.com/mcapuccini/MaRe ``` >Given I have a new dataset >When I mint a DOI for that dataset >Then the metadata for any external links of type "BioProject Link" in that dataset are included in the DataCite metadata Example: ``` https://www.ebi.ac.uk/ena/data/view/PRJEB29136 ``` >Given I have a new dataset >When I mint a DOI for that dataset >Then the metadata for the associated manuscript(s) in that dataset are included in the DataCite metadata Example: ``` 10.12345/gigaXYZ GigaScience|GigaByte Manuscript Title here 2023 ``` >Given I have a new dataset >When I mint a DOI for that dataset >Then the metadata for the related dataset(s) in that dataset are included in the DataCite metadata ``` http://doi.org/10.5524/100344 ``` NB- currently this functionality only adds the 100344 part, it needs to be corrected to include the full DOI URL as above. >Given I have a new dataset >When I mint a DOI for that dataset >Then the metadata for the Funding in that dataset are included in the DataCite metadata Example: ``` National Institute for Health Research http://dx.doi.org/10.13039/501100000272 CRSII5_189921/1 ``` NB - the VALUE IS STORED IN OUR FUNDER TABLE as URI ## Additional Infos Link to the latest specs: https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf **Is your feature request related to a problem? Please describe.** The metadata we export to datacite is currently using their schema 4.0, they are now upto 4.3 which includes some additions that we could make use of, e.g. resourceType of related identifiers, and contributor types. In particular: 1 - We need to fix the usage of the creator names fields, at present we are using the generic `[LastName] [FirstName]` The schema actually allows for the use of first and last name fields to ensure these are used the correct way around, as well as the addition of affiliation if we start collecting that info: ``` Starr, Joan Joan Starr 0000-0002-7285-027X California Digital Library ``` 3 - Other relatedIdentifiers should be added to the XML, using dataset [100747](http://dx.doi.org/10.5524/100747 )as example. This dataset has links to GitHub, OSF, Zenodo and SRA BioProject, all of which could be included in DataCite metadata under the section. e.g. **Zenodo & OSF (by DOI):** ``` http://doi.org/10.5281/zenodo.3251033 http://doi.org/10.17605/osf.io/zjv86 ``` Additionally, we could include protocols.io links as DOI's ``` http://doi.org/doi-of-relevant-protocols.io ``` **GitHub:** ``` https://github.com/dib-lab/ONT_Illumina_genome_assembly ``` **BioProject links** ``` https://www.ebi.ac.uk/ena/data/view/PRJEB29136 ``` Additionally, all DatasetLinks (primary ones only) can be linked in the same way as the BioProject example above using the Links table to get the relevant paths. **Describe the solution you'd like** The mint DOI function in the admin pages needs to be updated to create the datacite XML as described above. ## Product Backlog Item Ready Checklist * [ ] Business value is clearly articulated * [ ] Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item * [ ] Dependencies are identified and no external dependencies would block this item from being completed * [ ] At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item * [ ] This item is estimated and small enough to comfortably be completed in one sprint * [ ] Acceptance criteria are clear and testable * [ ] Performance criteria, if any, are defined and testable * [ ] The Scrum team understands how to demonstrate this item at the sprint review ## Product Backlog Item Done Checklist * [ ] Code is complete * [ ] Automated tests related to the changes are implemented and passing * [ ] All automated test suites are passing locally * [ ] Code is refactored to best practices and coding standards * [ ] Documentation is updated as needed * [ ] A Pull Request has been created and review requested * [ ] Pull Request is reviewed and approved * [ ] The item has been merged to the develop branch * [ ] All automated test suites are passing on continuous Integration pipeline and item is ready to release

only1chunts commented 4 years ago

When this has been done we also need a script to run over the database and update things in DataCite to the new XML so that we capture as many links as possible in DataCite. Perhaps not ALL datasets, maybe limit it to those with a release date of 2019 or newer.

pli888 commented 3 years ago

@only1chunts says the mint DOI button may use Jesse's GigaDB API to generate the datacite XML and then submits that to Datacite via their API. I cant think of any reason it would need to update anything within the GigaDB database, its just taking details from the database and passing them to DataCite.

rija commented 3 years ago

As an author I want my dataset to be in DataCite So that it can be automatically propagated to ORCID

only1chunts commented 3 years ago

We have just been informed that the current way we are providing details to DataCite is being parsed by them incorrectly! A user emailed us with this:

I found an issue that I think you will want to address -- it looks like the names are registered incorrectly in the DataCite database, where the DOI is registered. Specifically, preferred names and given names are reversed. I believe this is an error in the way GigaScience is registering citation information in general, since I can see this goes back to an earlier publication of mine as well from a few years ago. I've tried to explain the problem in detail here, so you can correct it. It will probably require going back and re-updating the registered metadata for these DOIs going in the past.

Here are the details:

Crossref points me to DataCite

When I hit the crossref API with this DOI you provide below, I am told that this DOI is registered by datacite:

http://api.crossref.org/works/10.5524/100936/agency

{
  "status": "ok",
  "message-type": "work-agency",
  "message-version": "1.0.0",
  "message": {
    "DOI": "10.5524/100936",
    "agency": {
      "id": "datacite",
      "label": "DataCite"
    }
  }
}

Datacite has the metadata populated incorrectly

When I go to datacite and try to pull the metadata with their API, the given names and surnames are specified incorrectly:

https://api.datacite.org/dois/10.5524/100936

{
  "data": {
    "id": "10.5524/100936",
    "type": "dois",
    "attributes": {
      "doi": "10.5524/100936",
      "prefix": "10.5524",
      "suffix": "100936",
      "identifiers": [],
      "alternateIdentifiers": [],
      "creators": [
        {
          "name": "Nathan, Sheffield C.",
          "nameType": "Personal",
          "givenName": "Sheffield C.",
          "familyName": "Nathan",
          "affiliation": [],
          "nameIdentifiers": [
            {
              "schemeUri": "https://orcid.org",
              "nameIdentifier": "https://orcid.org/0000-0001-5643-4068",
              "nameIdentifierScheme": "ORCID"
            }
          ]
        },
        {
          "name": "Michał, Stolarczyk",
          "nameType": "Personal",
          "givenName": "Stolarczyk",
          "familyName": "Michał",
          "affiliation": [],
          "nameIdentifiers": [
            {
              "schemeUri": "https://orcid.org",
              "nameIdentifier": "https://orcid.org/0000-0003-2101-9061",
              "nameIdentifierScheme": "ORCID"
            }
          ]
        },

... Notice it says my givenName is "Sheffield C." and my familyName is "Nathan". Obviously, it should be that my family name is "Sheffield" and my givenName is "Nathan C.". I discovered this because when I try to use an automated citation importer in JabRef, so I could cite this in my gigascience manuscript, the reference is populated incorrectly:

This should say: Sheffield2021, with "Sheffield, NC", etc. The metadata importer is actually correct -- it's that the names are annotated wrongly in the database. Just to double-check, I tried this for an earlier GigaSicnece publication, and I've realized this is also true of an earlier GigaScience publication of mine from a few years ago:

https://api.datacite.org/dois/10.5524/100670

{
  "data": {
    "id": "10.5524/100670",
    "type": "dois",
    "attributes": {
      "doi": "10.5524/100670",
      "prefix": "10.5524",
      "suffix": "100670",
      "identifiers": [],
      "alternateIdentifiers": [],
      "creators": [
        {
          "name": "Jason, Smith P.",
          "nameType": "Personal",
          "givenName": "Smith P.",
          "familyName": "Jason",
          "affiliation": [],
          "nameIdentifiers": [
            {
              "schemeUri": "https://orcid.org",
              "nameIdentifier": "https://orcid.org/0000-0002-2688-0988",
              "nameIdentifierScheme": "ORCID"
            }
          ]
        },
        {
          "name": "Michał, Stolarczyk",
          "nameType": "Personal",
          "givenName": "Stolarczyk",
          "familyName": "Michał",
          "affiliation": [],
          "nameIdentifiers": [
            {
              "schemeUri": "https://orcid.org",
              "nameIdentifier": "https://orcid.org/0000-0003-2101-9061",
              "nameIdentifierScheme": "ORCID"
            }
          ]
        },
        {
          "name": "Nathan, Sheffield C.",
          "nameType": "Personal",
          "givenName": "Sheffield C.",
          "familyName": "Nathan",
          "affiliation": [],
          "nameIdentifiers": [
            {
              "schemeUri": "https://orcid.org",
              "nameIdentifier": "https://orcid.org/0000-0001-5643-4068",
              "nameIdentifierScheme": "ORCID"
            }
          ]
        },

I believe you have a systematic error in the way you are populating the datacite metadata. Do you think you can correct this? Thanks,

Nathan

rija commented 2 years ago

related (but not dependent on) to #115

rija commented 1 year ago

Datacite JSON REST API DOCs: https://support.datacite.org/docs/api

Sample input in JSON format: https://support.datacite.org/docs/api-create-dois

rija commented 1 year ago

JSON equivalent of the relatedIdentifier element:

{
    "data": {
        "attributes": {
            "relatedIdentifiers": [
                {
                    "relatedIdentifier": "https://doi.org/10.xxxx/xxxxx",
                    "relatedIdentifierType": "DOI",
                    "relationType": "References",
                    "resourceTypeGeneral": "Dataset"
                }
            ]
        }
    }
}

https://support.datacite.org/docs/updating-metadata-with-the-rest-api

only1chunts commented 10 months ago

NB- The DataCite Schema v3.5 (which we currently use) will be deprecated at the end of 2024, so we MUST get this update done before then.

only1chunts commented 8 months ago

maybe of interest - here is a presentation by DataCite staff describing the differences in schema v4.5 and the deprecation of v3 schema. I believe we are actually using v4.0 (which was released in 2016) https://youtu.be/i_4Uf_VB5Rw

only1chunts commented 5 months ago

(in progress, will update comment here when its complete) Mapping of DataCite schema v4.5 to GigaDB schema terms https://docs.google.com/spreadsheets/d/18x5l8GU8FNV3Og_RCF_Ei142tSbLv45TZN4Tfqko6sk/edit#gid=1158026556

ScottBGI commented 3 months ago

Can I add a vote/appeal to prioritise updating the metadata as soon as possible, as it should be a very easy and simple update? We need to strike while the iron is hot and if we drop the ball and wait too long to address this the metadata here will get out-of-date again, we'll then want to update again probably waiting for another opportunity to do this, and it will drag on further and waste getting a funded internship in this summer

ScottBGI commented 3 months ago

The updated metadata is here: https://github.com/Jeffrey-yu-hc/GIGADB-MAPPING/blob/main/gigadb_mapping_coding(final_edition).py

only1chunts commented 3 months ago

The link @ScottBGI provided is the code writen by the intern. Its designed to run in Google colab as an iPython notebook. In theory it can create DataCite schema 4.5 compatible files for each dataset in GigaDB. He did not get as far as validating any of the output. On visual inspection it appears to be good, and the mapping of GigaDB elements to the correct DataCite elements appears correct.

only1chunts commented 2 months ago

We have just had another random user point out the issue on the GigaDB "Cite Dataset" button where it messes up the authors because DataCite mess up the authors when we pass them the details at present, so getting the schema we use updated will fix that issue as long as we update all existing datacite dataset details.

only1chunts commented 2 months ago

FYI - the code ran by Jeffery is also in colab notebook here: https://colab.research.google.com/drive/1kkf36UdU4BIdP_bmBme4swETCNpZy_uo