sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

additional comments around pubmed date searching #1539

Closed peetucket closed 2 years ago

peetucket commented 2 years ago

Why was this change made?

Based on some questions around article publication dates in Pubmed records, I double-checked the search for dates and the order in which we search for them. I believe the current ordering is still valid and correct despite the one example shown below, but I updated the code documentation so that the ordering of links in the code comment reflect the order in which we search for dates and just added a bit more text.

How was this change tested?

Existing tests

peetucket commented 2 years ago

History from #sul-cap-collab that tiggered this investigation (https://stanfordlib.slack.com/archives/C0T8CJD3L/p1660122962480119)

I have a user who added a couple of publications to a profile on behalf of a faculty member.  They expected one to appear first in the reverse chronological order, but instead the other one appears first.  When I look in our database at the First Published Date for the two publications, the order is correct as we have it:
DSIF modulates RNA polymerase II occupancy according to template G + C content.  First Published Date:  01-SEP-22
Chemical interference with DSIF complex formation lowers synthesis of mutant huntingtin gene products and curtails mutant phenotypes.  First Published Date:  09-AUG-22
However, the user states the date on the first article above is July 27, 2022.  When I go to the PubMed link for the Pub, I see both dates in the following:
. 2022 Jul 27;4(3):lqac054. doi: 10.1093/nargab/lqac054. eCollection 2022 Sep.
DSIF modulates RNA polymerase II occupancy according to template G + C content
Can you confirm what each date represents above and if we are receiving the correct First Published Date for this publication, based on the above information on PubMed?
Thank you!

[Justin Littman]
Hello Tina - The September date is the journal issue publication date. The July date is the article date. In selected a date, the journal issue publication date is prioritized over the article date. (I don't know the reason for this -- I'm just reading the code.)

[Peter Mangiafico] 
Thanks Justin.  I had a peek too, and I'll add some more detail.  We search multiple places in the Pubmed source data record to find publication dates (see https://github.com/sul-dlss/sul_pub/blob/main/lib/pubmed/map_pub_hash.rb#L250-L263), because I think there is inconsistency in where information is stored in their return data.  We search in a specific order as shown in that code above, and as soon as we find something that is in a suitable date format, we stop and record it as the publication date.  I also am not sure of the reason for the specific order we choose, but my interpretation of the Pubmed documentation is that each location is supposed to represent a publication date (doesn't mean it always will though).  See https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/180101/el-PubDate.html and https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/180101/el-JournalIssue.html    For this particular DOI record, I see the July 2022 date in the very next location we would have searched, so in this case it would make sense to invert the search order of the first two locations in our code and re-generate the pub hash data.  Whether this makes sense for all future records as well, I am of course not sure, nor am I sure how the order was orginally selected.  My suspicion is that no matter what we choose, we will always find some edge cases that are wrong, but it may be worth spending a few more minutes reviewing the Pubmed documentation and double checking the order to see if it may be better in some other order.  I can do this later this week when our all day meetings are over.
peetucket commented 2 years ago

See #1540

peetucket commented 2 years ago

Good catch. Created a new PR, removed offending files, and updated .gitignore to prevent from being added by mistake again (in the new PR)