Closed andrewsu closed 4 months ago
I'll look into this in depth later but off the bat I'll say that figuring out which fields to use was very difficult. I wasn't able to find any documentation on what fields meant what or which ones I should use. And in testing with Casey Greene's large listing I found that sometimes one field was the canonical paper and sometimes another one was. So, I'm not sure it's even consistent enough to always use one field or the other.
If I recall (I'm responding on mobile right now), I have it look all the way through one section of the response for a doi, then move on to another? Maybe it needs to check the first entry in one section, then the first entry in another, then back to the first section, and so on.
But this does feel like a situation where fixing it for you would break it for someone else. I'm not sure how to handle it.
You should be able to modify your orcid.py script to do what you want in the meantime. I might actually try to rope in a developer from orcid to tell me which fields to use.
After reading your reply, I realized I had a slight misconception about how the plugin worked. That means a potential enhancement might be easier to make than I was anticipating. Let me play around with it a bit -- I'll post here with more details. So no need for you to look at this more for the moment... Thanks for the quick reply!
So after closer look, the orcid plugin already gets IDs from the group.external-ids.external-id
section and the group.work-summary.external-ids.external-id
section, in that order. However, as documented above, the ordering in the group.work-summary.external-ids.external-id
appears to better reflect the user intent (as people have curated their ORCID profile) than the group.external-ids.external-id
section. So in my local version (git diff pasted below), I just reversed the priority order between those two sections. I also added a filter on external-id-relationship
, requiring that to be "self". (When that value is "version-of" or "part-of", then those external-ids are less appropriate.)
To test this, I compared the IDs returned by both the current orcid.py and my local orcid.py for both my ORCID profile and Casey's ORCID profile. The list of IDs retrieved for me had differences in 7 out of 145 sources -- in all cases the local version returned IDs that reflected the "preferred source" in my ORCID profile (the journal version, not the biorxiv version), which to me is the desired behavior.
current orcid.py | new orcid.py |
---|---|
id: doi:10.1101/2021.04.15.440028 | id: doi:10.1093/bioinformatics/btac205 |
id: doi:10.1101/2020.04.05.026336 | id: doi:10.1186/s12915-020-00940-y |
id: doi:10.15346/hc.v6i1.105 | id: doi:10.15346/hc.v6i1.4 |
id: doi:10.1101/289371 | id: doi:10.1038/s41746-018-0052-2 |
id: doi:10.1101/038083 | id: doi:10.5334/cstp.56 |
id: doi:10.1101/032144 | id: doi:10.1093/database/baw015 |
id: doi:10.1101/008383 | id: doi:10.1093/bioinformatics/btv061 |
The list of IDs retrieved for Casey's ORCID profile was exactly the same between both versions of the plugin code. So why does this affect me but not Casey? At least in a couple cases I looked at, Casey has separate entries for the journal paper and the biorxiv preprint. For example:
You can also see these as separate entries on https://greenelab.com/members/casey-greene.html. Of course, it's possible and reasonable that Casey (and others) prefer to have the preprints and the journal articles as separate entries.
So to wrap up, I'm happy to create a PR if you think this change is useful to make to the template (or feel free to just make the commit yourself). Or, I'm happy to leave this as a local change on my instance and close this issue, and at least the solution is here to find if anyone needs it in the future. Your call, Vince!
$ git diff _cite/plugins/orcid.py
diff --git a/_cite/plugins/orcid.py b/_cite/plugins/orcid.py
index 0017145..333821b 100644
--- a/_cite/plugins/orcid.py
+++ b/_cite/plugins/orcid.py
@@ -35,13 +35,14 @@ def main(entry):
# go through response structure and pull out ids e.g. doi:1234/56789
for work in response:
# get list of ids
- ids = get_safe(work, "external-ids.external-id", [])
+ ids = []
for summary in get_safe(work, "work-summary", []):
ids = ids + get_safe(summary, "external-ids.external-id", [])
+ ids = ids + get_safe(work, "external-ids.external-id", [])
# prefer doi id type, or fallback to first id
_id = next(
- (id for id in ids if get_safe(id, "external-id-type", "") == "doi"),
+ (id for id in ids if get_safe(id, "external-id-type", "") == "doi" and get_safe(id, "external-id-relationship", "") == "self"),
ids[0] if len(ids) > 0 else {},
)
(Commit in https://github.com/andrewsu/sulab.org/commit/a0c70014c89dad40afc511366ec017abed1b69c4)
Thanks for your detailed research and comments on this.
I think this is worth keeping open and investigating more. I'd definitely like to get an authoritative answer on which field to use. I will sit down and look through this next week.
But for now yes please use your modification for your case.
I try to go into the publications list and change the preprint entry to a publication entry when it fails, but I am not always successful in remembering to do this.
@cgreene Do you have any contacts at ORCID that we could pull into this issue?
I don't think so - sorry!
Just getting to this now. Still working on it. Some notes:
/works
endpoint (very insufficient). Many fields missing or uncommented.I've asked to join that forum Google Group. Once I get access, I'll post a question about whether the works summary always should be preferred over the top level of the response. Intuitively, I would've expected the top level to be the "main" thing to look at, but 🤷. I don't feel the chances of getting a definitive or timely answer from the forums are high, so I may just make this change and hope that your case is representative. I'll also do my own check of Casey's corpus to make sure nothing gets screwed up there.
With this change I'm making in v1.2.2, I think I don't need to prioritize doi
ID types anymore, and I can just look at whichever id is first. This will make it use Manubot in all but the most obscure cases. Let me know if you have thoughts on this. This might change some of your citations, as more of them will be generated via Manubot vs. ORCID. If you ever wanted to always keep the citation details ORCID returns, I believe it'd be as simple as removing the if id_type not in list...
condition.
I also added a filter on external-id-relationship, requiring that to be "self".
~I just wish I could get information on what this field really means, beyond "external-id-relationship indicates the relationship between the item and the identifier". I may add this... I'll test it with Casey's corpus to see if it changes things too much. Do you have more info on this? Were you seeing "version of" for preprints, for example?~
Nevermind, it's explained here.
With this change I'm making in v1.2.2, I think I don't need to prioritize doi ID types anymore, and I can just look at whichever id is first. This will make it use Manubot in all but the most https://github.com/manubot/manubot/issues/365. Let me know if you have thoughts on this. This might change some of your citations, as more of them will be generated via Manubot vs. ORCID. If you ever wanted to always keep the citation details ORCID returns, I believe it'd be as simple as removing the if id_type not in list... condition.
I don't think this change will affect me much at all, since 98% of my citations use a doi
prefix. (Much of the work in cleaning up my ORCID profile is merging non-doi IDs into the doi entries, which I consider to be the canonical ID.) If anything, I think it will help because manubot I'm guessing will not have ambiguous dates (between created-date
and last-modified-date
, for example), fixing the issue I raised in #260.
Here is a spreadsheet of a test I just did between the changes in v1.2.2 vs. the current implementation in v1.2.1, for both Andrew and Casey, with differences between old/new highlighted in red. I'm not seeing any differences for Casey, and quite a few for Andrew. I'm assuming that's because he seems to curate his preferred IDs for in his ORCID, and many of them are not DOIs.
Not sure what to make of this yet. The changes are "live" on the v1.2.2 branch, but need to think over this more.
Would you consider making this change as well? https://github.com/SuLab/sulab.org/commit/a0c70014c89dad40afc511366ec017abed1b69c4
I could be turned around here, but I think that would change it back so the vast majority of my IDs are DOIs, and it would not affect Casey's profile much or at all.
Wouldn't this make it so that the "preferred source" might not be selected? Like if you set your preferred to a PMID, will ORCID still fill in a DOI for the same source...
I think prioritizing the order over citation type might be the more correct thing to do, even if it changes things. I neglected to record the full citation details in that spreadshee, to see if Manubot still gave the same details between a doi and equivalent other id. I'll run the test again later and compare.
Wouldn't this make it so that the "preferred source" might not be selected? Like if you set your preferred to a PMID, will ORCID still fill in a DOI for the same source...
No, I don't think that's the case. I previously set all of my ORCID pubs so that the journal DOI was used for the "preferred source". I just flipped one entitled "Metaproteomics of colonic microbiota unveils discrete protein functions among colitic mice and control groups" so that the biorxiv DOI is preferred. So in my ORCID API XML , it is that work (and only that work) that has the biorxiv DOI (with prefix doi:10.1101
) listed first in the work_summary
section...
Okay Andrew, I've set up the following comparison with your ORCID.
Go to https://www.jsondiff.com/, open the browser dev console and paste this code:
It should print two lists of citations, old (v1.2.1) vs. new (v1.2.2). If you're using Chrome, you can right click them to copy to clipboard and paste them into the diff tool.
Looking over the diff... the IDs are different of course, but the dates are also different. The titles and author names seem to be different but equivalent, just spot checking them. Here's a preview of the differences:
I'm not sure what to make of all of this. I feel like I want to change the behavior to not prefer DOIs, and instead prefer the first work summary, which seems to reflect the author's preferences the best. We only have Andrew's data point for this, but in absence of any documentation or support, it's the best we can do. I didn't analyze Casey's citation differences because his IDs didn't change; they're all basically DOIs in new and old.
@cgreene Do you have opinions on what the picking algo should be here? Summary: Currently, the orcid template plugin prefers DOI citations over any other type. (Don't know why I did this originally... Consistency? DOI's inherently "better"? Manubot could cite them "better"?). Now, just based on Andrew's case and no documentation, it seems like just picking the first ID from the orcid reflects what the author has selected as their "preferred source" on orcid.org. Not sure how this affects people who haven't set any preferred source.
Checks
Link to your website repo
https://github.com/andrewsu/sulab.org
Version of Lab Website Template you are using
1.2.1
Description
TLDR: I think
orcid.py
should be updated to get theexternal-id
from the firstwork-summary
entry, not from the first doi inexternal-ids.external-id
.Observed/desired behavior
On https://andrewsu.github.io/sulab.org/members/andrew-su.html, I see the following citation
The citation points to the biorxiv preprint (note that CSHL is the publisher/journal). However, in my ORCID profile, the preprint is grouped with the published article at Bioinformatics, and it is that version of the article that I've listed as the "Preferred source".
Therefore, I would expect that the citation on my profile page would list "Bioinformatics" as the journal/publisher and the link would go to the journal website, not biorxiv.
My diagnosis
When
orcid.py
retrieves my ORCID profile from https://pub.orcid.org/v3.0/0000-0002-9859-4104/works, this is the XML for the article above:XML snippet for article above
```Currently, the code gets the list of IDs from
external-ids.external-id
section, and I believe it takes the first entry with aexternal-id-type
ofdoi
-- in this case,10.1101/2021.04.15.440028
. Unfortunately, this section does not look like there is any indication of the "preferred source". However, it appears that in thework-summary
entries are listed so that the "preferred source" is listed first --10.1093/bioinformatics/btac205
is first.I checked this behavior against another entry in my profile where the biorxiv article is listed as primary. See details below.
ORCID screenshot and XML snippet where biorxiv is listed as "preferred source"
![image](https://github.com/greenelab/lab-website-template/assets/2635409/305a91b7-e58d-46dd-af32-2b2eadbd6b26) ```Based on all this, I think
orcid.py
should be updated to get theexternal-id
from the firstwork-summary
entry, not from the first doi inexternal-ids.external-id
.