Support for 4.5 Schema - Githubissues

tmorrell commented 7 months ago

We should support the DataCite 4.5 Schema. https://doi.org/10.5438/jvkk-8198

### Tasks
- [x] Transfer over starting schema from @mfenner https://github.com/front-matter/commonmeta-py/blob/main/commonmeta/resources/datacite-v45.json
- [x] Transfer all example 4.5 files from https://schema.datacite.org/meta/kernel-4.5/
- [x] Resolve doi/identifiers https://github.com/inveniosoftware/datacite/issues/80
- [ ] Check if anything in https://github.com/inveniosoftware/datacite/issues/35 is relevant

mfenner commented 7 months ago

The datacite-v45.json schema I started didn't spend a lot of time aligning with previous versions. The two new resourceTypeGeneral are easy, the new publisher property is tricky. I decided to allow either a string or a dict as publisher, there are obviously other ways to do that (in the related commonmeta publisher has always been a dict because of the mapping to other formats, e.g. schema.org).

mfenner commented 7 months ago

One question I have is continuing support for DataCite XML. Do we need that? Internally DataCite is storing everything as JSON (mainly because the metadata go into Elasticsearch). Do we need DataCite XML in Invenio?

tmorrell commented 7 months ago

The XML export got added to RDM, so I'm planning on keeping it in here until someone tells me otherwise.

edager commented 7 months ago

As of v4.4 (i.e. also v4.5) there's the relatedItems array which is a bit of beast in that it's comparable to the rest of the record in complexity. Furthermore, it reuses the structure of "creators" and "contributors", which themselves only differ by a single nested field. It might be worth it to:

Make a definitions of creator

Make definition of contributor by extending the creator:

    "contributor": {
        "type": "object",
        "allOf": [{ "$ref": "#/definitions/creator" }],
        "properties": {
            "contributorType": {"$ref": "#/definitions/contributorType"}
        },
        "required": ["name", "contributorType"]

reference these two objects as appropriate

I have a v4.5 implementation that among other things include the following that needs to be polished and tested a bit, if you want it?

tmorrell commented 7 months ago

That would be awesome. In the RDM schema we already do the same sharing between creator and contributor, so sharing the structure here makes a lot of sense.

edager commented 7 months ago

That would be awesome. In the RDM schema we already do the same sharing between creator and contributor, so sharing the structure here makes a lot of sense.

Cool, which workflow is most convenient for you? Should I fork and do a PR to a specific branch or something ( I'm supposing the schema45.py and some tests will have to be implemented as well, so you probably want it stepwise) ?

mfenner commented 7 months ago

Very nice. I do something similar with creator vs. contributor in my related commonmeta-py library. One more reason there is that Crossref and ORCID call creators contributors (in the case of Crossref with role author).

tmorrell commented 7 months ago

I just made a 4.5 branch, so you can fork and do a PR to that. If I have time I might start adding in some of the test files and architecture. I anticipate it will be an iterative process till we get it working the way we want to. Make sure to include copyright statements at the top of files for any code you add.

edager commented 7 months ago

@tmorrell, I'm not entirely sure what the best practise is for copyright statements in json-schemas, json is a bit weird when it comes to comments? Obviously it will also be MIT, but I probably should mention the academical institution I work for.

I've made the first version of a 4.5 compliant json-schema.Do you want a PR already or more complete implementation (not sure how much time I'll have the next weeks though)? @mfenner any chance I can get you to have a look at the json-schema as well?

tmorrell commented 7 months ago

Ah, I forgot the json schema doesn't include comments. Add it to https://github.com/inveniosoftware/datacite/blob/master/LICENSE in your PR.

I'd lean toward putting in a PR now. I'm not sure yet how we'll work in the branch...if we just merge things in and work on them or work on PRs until they are more solid. But having a PR is a good first step.

mfenner commented 7 months ago

The 4.5 compliant schema looks good too me. Some minor points (the first two were also in the previous schema version):

resourceType is not required, only resourceTypeGeneral
identifiers is a problem with the datacite json that is my fault. It is a duplication of alternateIdentifers, confusing (the original intention is no longer relevant here) and I think can be dropped
publisher object should have "required": ["name"], and it also has property "lang": {"type": "string"}

Some more cleanup work could be done to remove redundancies.

edager commented 7 months ago

@mfenner thanks for the suggestions :)

I'm also not sure whether it makes sense to keep both identifiers and alternateIdentifiers, especially as all content of alternateIdentifiers will be ignored by the REST API if identifiers is supplied as per this note .

I personally favour only keeping alternateIdentifiers as it has the same name in XSD and JSON-schema then. With that said though, it's an allowed field so just want to get confirmation from @tmorrell that it should be removed.

I'll also convert longitude and latitudes to be number types with bounds one them see here

mfenner commented 7 months ago

Sorry, I meant to only keep alternateIdentiers and drop identifiers.

With removing redundancies I meant mainly to look into relatedItems. It is basically a superset of relatedIdentifiers.

edager commented 7 months ago

Sorry, I meant to only keep alternateIdentiers and drop identifiers.

With removing redundancies I meant mainly to look into relatedItems. It is basically a superset of relatedIdentifiers.

@mfenner I was trying to agree with you, but apparently it didn't came across like that, sorry about that.

And it's a good point about relatedItems and relatedIdentifiers, I'll try to simply them.

edager commented 7 months ago

I'll make the first PR with the above mentioned things incorporated.

There's one thing we might have to investigate further, namely wheter "additionalProperties": false have to be set for all subschemas in order for it to have the intended effect. Could be we need to also use "unevaluatedProperties"

edager commented 7 months ago

@tmorrell , I was looking into transfering all the 4.5 example files, but that led to a implementation concern that I'm not sure how to address: How much responsibility should the jsonschema have? In particular:

Only being strictly compatible with the XSD
Contain enough information to ensure that a publish POST request via API will not have missing fields ("url", "doi"/"prefix"/"suffix" )
Be able to validate (published) records from GET requests to the API ("additionalProperties": false should no longer be there in that case)

I'm asking as the json files in the previous versions are direct translations of xml (which makes sense), however the jsons available from here: https://schema.datacite.org/meta/kernel-4.5/ are REST records which contains enrichment fields not found in the schema (citation count, record created, etc.).

Personally I would prefer if DataCite would more clearly separate the information given by the record creator (direct translation of xml to json) from the information datacite is enriching the record with, but I doubt they'll do that to not mess with the workflows of current consumers.

I can implement it either way, just want to know what is the most sensible way forward.

tmorrell commented 7 months ago

For 4.3 we checked that the XML and JSON->XML versions were valid based on the XSD. We also tested that the JSON files would POST via the python wrapper (which adds the prefix and url) https://github.com/inveniosoftware/datacite/blob/b7829497ed5c5480901982ee83a6ed25650688bd/tests/test_rest.py#L90.

The"additionalProperties": false bit is kind of tricky. I think we need either set it to true or add all the extra fields in the REST output to the json schema. The motivation to have additionalProperties: false is that it helps catch typos...and it definitely has saved me a few times. But it's also a lot of extra fields to add to the jsonschema that people shouldn't really use. I'm not sure about the best solution for this one.

edager commented 7 months ago

I'm also not a big fan of removing the additionalProperties, in fact I would like to enforce this at all levels as in 4.3 it's only being enforced at the top level as far as I can tell?

Furthermore DataCite might want to continually add new features/fields to the REST API without adding them to the XSD so we would have to keep up two things. I'm leaning towards option 1, especially when the wrapper is taking care of the repository specific things like url an doi as you say.

So I'll only transfer the xml ones and make xsd compatible json examples from them?

tmorrell commented 7 months ago

Yeah, I think that's probably the best path forward. I was hoping we'd be able to use the DataCite ones....but the REST json is just too different. As long as we can validate that we can send the JSON representations to DataCite and mint DOIs that should be sufficient.

mfenner commented 7 months ago

I agree with using "additionalProperties": false and with focusing on DOI registration. But I would like to add url and doi, as they are required for DOI registration. The wrapper is nice, but a dependency that might not be there when registering DOIs outside InvenioRDM. And I would avoid multiple flavors of a DataCite JSON Schema.

I hope that in the not too distant future DataCite drops the XML and XSD, as internally all DOI metadata are stored as JSON in Elasticsearch and MySQL, similar to how InvenioRDM stores metadata.

edager commented 7 months ago

@mfenner I can see your point of wanting to have a single authoritative schema, but that almost sounds like it would be best if the schema lived in a separate repository, then?

@tmorrell Are you onboard with including "url" as required, "oneOf": ["doi", "prefix", {"prefix", "suffix"}] , and "event" as optional?

edager commented 7 months ago

@mfenner @tmorrell if we need the schema to be able to validate a complete request payload, we also need to add the following fields, and place all the current properties within "attributes":

{
    "data": {
      "type": "dois",
      "attributes": {}
    }
}

It's beginning to feel like the json schema is validating implementation details of the REST API, which might lead to frequent updates of the json schema to keep up with the REST API ( in addition to keeping up with xsd schema updates). It's not a deal breaker, its more of a question of how often do we prefer changing the python wrapper rather than the json-schema?

mfenner commented 7 months ago

I would not go down this path, I.e. the JSONAPI specification. But I would want to have the DOI and URL in the schema as they are essential elements. The URL is not in the XSD for historical reasons, DOI registration used to require two API calls, one with the handle system to store DOI and URL and one to store metadata.

inveniosoftware / datacite

Support for 4.5 Schema #81