proycon / codemetapy

A Python package for generating and working with codemeta
https://codemeta.github.io/
GNU General Public License v3.0
24 stars 5 forks source link

Add support for ORCIDs #30

Open proycon opened 1 year ago

proycon commented 1 year ago

Authors are best identified by their ORCID. We ideally need a way of resolving user emails to orcids automatically (does their API offer such a function?).

broeder-j commented 1 year ago

Yes, it does: https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/ BUT (this is what I figured could be wrong): emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most people do not change the default, so i expect this way to yield 10%. (test query https://pub.orcid.org/v3.0/csv-search/?q=affiliation-org-name:ORCID&fl=orcid,given-names,family-name,current-institution-affiliation-name,email)

A better way could be to find people over name, plus affiliation, i.e. institution name or identifier. Here codemetapy probably only has a chance if the institution is given or it can get it from the metadata already there... How to do this I do not know, since contributors can be from everywhere, maybe a first thing would be to allow for a list to try.

Let me know if you plan to work on this. I have a layout of what I want, but not implemented anything yet and it is currently not on my todo list

in terms of code out there I found this which is old and may or may not work: https://github.com/ORCID/python-orcid https://github.com/scholrly/orcid-python

proycon commented 1 year ago

emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email > level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most > people do not change the default, so i expect this way to yield 10%.

Too bad, this would be the ideal method but if it yields only 10% it's not very useful indeed.

A better way could be to find people over name, plus affiliation, i.e. institution name or identifier.

That sounds viable yes, though one issue with affiliations is that people tend to come and go in institutions.

..maybe a first thing would be to allow for a list to try.

Like explicitly passing a tsv file to codemetapy with say emails and orcids? That would work yes, though it isn't as fully automated as we'd want ideally.

broeder-j commented 1 year ago

An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347".

Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.

also in that context the familyName and givenName parsing is also not optimal if the link of the person does not contain the name, example:

       {
            "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347",
            "@type": "Person",
            "email": "cMax347@max.merte@gmail.com",
            "familyName": "",
            "givenName": "cMax347",
            "position": 71
        },
        {
            "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/christian-roman-gerhorst",
            "@type": "Person",
            "email": "c.gerhorst@fz-juelich.de",
            "familyName": "Gerhorst",
            "givenName": "Christian-Roman",
            "position": 72
        }

So it has also problems with middle names. I would assume that these would be easier to parse from an Citation.cff file.

proycon commented 1 year ago

An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids

Hmm.. Agreed, if there are ORCIDs then they shouldn't be overwritten. I wonder if it's an issue in codemetapy or in https://github.com/citation-file-format/cff-converter-python, we don't do the CITATION.cff parsing ourselves.

but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347".

(it's not the gitlab id, see #34)

Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.

proycon commented 1 year ago

also in that context the familyName and givenName parsing is also not optimal if the link of the person does not contain the name, example:

  {
       "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347",
       "@type": "Person",
       "email": "cMax347@max.merte@gmail.com",
       "familyName": "",
       "givenName": "cMax347",
       "position": 71
   },

Yes, we'd better just use schema:name if we can't decipher given and family names, needs some fine-tuning. That e-mail looks malformed too. For the actual name parsing from arbitrary strings I'm using nameparser

proycon commented 1 year ago

I've been giving this some more thought and there are some challenges to solve, mostly related to 'affiliations':

  1. In the current implementation, whenever an author appears in multiple software metadata projects (or even multiple times in the same one), there is a high risk of properties getting conflated if not consistently named. The most notable one is 'affiliation'. If an author at various points has different affiliations (or even the same one but not consistently named). Then these will all be propagated to all instances when the full graph of multiple software projects is loaded.
  2. Related to the above: 'affiliation' is a property of a schema:Person. But that means it is no longer attached to any specific software project, meaning we can't differentiate between affiliations at the time of the sofware project or later/before. We'd always get all of them, which may be less informative than desired. It's common for people to have (had) multiple affiliations throughout their career. We do use schema:producer to tie software projects to institutions directly, so at least that is expressable (relates to codemeta/codemeta#286)
  3. We already ascertained that automatically going from names or e-mails to ORCIDs is hard. We probably need a custom mapping as input (like a tsv file).
  4. The reverse, going from ORCIDs to all the names/emails/urls is fairly easy, we can just query orcid.org and request application/ld+json to get a schema.org representation that is compatible with codemeta. Some caveats there:
    • It does not contain the e-mail, even if it is public. The turtle output, however, does (it uses a completely different vocabulary than the JSON-LD serialisation)
    • The JSON-LD output lists all affiliations it knows (including those that have ended, but that information is not outputted). The turtle output lists no affiliations at all.
proycon commented 4 months ago

Possibly relevant: ORCID profiles can be tied to Github accounts. If the GitHub API exposes this it provides a nice way to find ORCIDs.

See https://scicomm.xyz/@ORCID_Org/112282433046701907