inveniosoftware / datacite

Python API wrapper for the DataCite API.
https://datacite.readthedocs.io
Other
27 stars 33 forks source link

4.5 #87

Closed edager closed 2 months ago

edager commented 7 months ago

:heart: Thank you for your contribution!

Description

Please describe briefly your pull request.

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Third-party code

If you've added third-party code (copy/pasted or new dependencies), please reach out to an architect.

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.
edager commented 6 months ago

@tmorrell is there something you need to be able to merge?

tmorrell commented 6 months ago

Nope, I just need to find time to review. Hopefully soon!

edager commented 6 months ago

Lacking the most precious element of all, no worries. Just wanted be sure I was going about this the right way

srerickson commented 3 months ago

Apologies for this uninvited comment/question. I've been looking at this repo, and this PR in particular (thank you!), while attempting my own implementation of DataCite's schema in Go. I noticed a discrepancy in DataCite's handling of latitude/longitude values, and I wanted to see how you're handling it. The "full" json example linked in the schema docs, includes latitudes and longitudes as strings (quoted floating point values). The API docs also define lat/long fields as strings.

However, many live DOIs I've looked at (for example) use numeric values. I also noticed that the json schema in this PR defines latitude and longitude as numeric values. I'm wondering why.

I'm genuinely not sure what approach I should take. Python is much more flexible and forgiving than Go when it comes to these data type issues!

tmorrell commented 3 months ago

There had been a long discussion about strings versus numbers, though I can't find a link at the moment. In the 4.3 schema we went with strings, because DataCite used them and that ensures that precision isn't lost when transferring between formats. In InvenioRDM when we implemented datacite-style internal metadata we did locations as numbers because we were following geojson. Numbers also allows you to validate that you're in the right range for the values, and it's what people are more familiar with.

This PR uses numbers...so we might switch in 4.5 but also might not.

On a related note myself and @mfenner are planning a work day in June to get this PR reviewed and get a 4.5 release candidate up.

edager commented 3 months ago

@srerickson I understand the confusion, and thanks @tmorrell for the explanation. I have an open issue on Datacite's repo regarding the string vs. number issue. The XSD (xml schema) that Datacite's schema is based on specifies that it should be numbers, however as you point out, their json maps it to string. I haven't heard if they plan to change it to numbers, so we might have to turn it into strings again.

mfenner commented 3 months ago

I am fine with numbers instead of strings for publicationYear, longitude and latitude for the reasons mentioned before. @srerickson I implemented an initial DataCite metadata conversion in Go in the last few weeks: https://github.com/front-matter/commonmeta. I found that in a few places I had to handle multiple types returned by DataCite, basically publisher with since schema 4.5 can be string and struct, affiliation which also can be string or struct, and publicationYear, which can be number and string (for example API vs. data dump). I managed to do this in Go without needing an interface. https://github.com/front-matter/commonmeta/blob/main/datacite/reader.go#L53

Similar issues happen in Python, e.g. if you want to parse the DataCite data dump in addition to the API response.

As @tmorrell said, we will try to wrap this issue up in June.

srerickson commented 3 months ago

Thank you all for the feedback -- this is a very helpful! @mfenner, I'll check out your implementation

tmorrell commented 2 months ago

I reviewed this with @mfenner today and this looks good to go with two changes.

This PR will also need to be re-targeted at main and rebased (there are black formatting changes we'd like to pick up).

I'm planning on working on this in the next few days, but feel free to make these fixes if you'd like. I might need to make a new PR to do the re-targeting.

edager commented 2 months ago

@tmorrell @mfenner thanks for the review, and I can of course incorporate the changes.

  • Change publication year to string (DataCite does this and it makes a bit more sense)

It should already be of type string, please however be advised that publicationYear and relatedItem.publicationYear has different types (integer and string respectively) in the JSON REST API even though this is not the case in the XSD (I raised an issue on that).

I can also reformat this with black, however, it might make sense to wait until after a merge to do this on the whole repo for the sake of easier reading of the commit history? It's not a strong opinion, I just wanted to hear what is most convenient for you.

tmorrell commented 2 months ago

Sorry...I misstyped. Switch publication year to number. I'm going to try to rebase the 4.5 branch so it picks up the black formatting. I applied it to the whole repo after making the branch. Apologies again for this taking so long!

tmorrell commented 2 months ago

@edager I've updated the 4.5 branch and rebased, so you should do a pull if you're making changes. I'll probably try to do the publication year switch and regex later today so we can get this merged into the branch.

tmorrell commented 2 months ago

I've pushed some incomplete changes. During some final testing I realized that additionalProperties doesn't impact child values, which didn't flag errors. I also had to use unevaluatedProperties for the allOf structures in order to make that checking work. Please let me know whether these changes look right.

I switched over publicationYear, but have now realized that DataCite implemented it inconsistently in relatedItem. https://api.test.datacite.org/dois/10.82433/q54d-pf76?publisher=true&affiliation=true. I think I might just switch it back to string.

I'll try to wrap up the updates tomorrow.

mfenner commented 2 months ago

@tmorrell I would support sticking with publicationYear as number, despite some inconsistencies.

tmorrell commented 2 months ago

I've wrapped up my changes, you can see the main ones here https://github.com/inveniosoftware/datacite/pull/87/commits/ac63db3a82b9f150e5d786e782d27676d638c5a0

I ended up updating geoLocationPolygon so it matched with the DataCite REST API response (though you can't send in a polygon due to https://github.com/inveniosoftware/datacite/issues/93), had to switch out all the regexes, and switched schemeURI in publisher which I discovered when I tightened up the schema. I also removed url from required, since when you're using the schema with the python library you can provide the url with your functions.

I'm going to merge this into the 4.5 branch, and then open up a new PR for the merge into main. My thought is I'll leave that one open for a week in case there is more community discussion (such as publicationYear)