sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Publication page numbers not correct for old sciencewire records #1194

Closed peetucket closed 3 years ago

peetucket commented 3 years ago

Came in via a support ticket:

Professor Stanley N. Cohen has a number of his publications that have a ‘?’ in the end page number. His profiles id is 4481.

peetucket commented 3 years ago

I investigated this and found ~25 publications that had this ambiguous page range. All of the records had a provenance of Sciencewire, so were harvested prior to the switch to the new Web of Science API. Even though we don't have access to that API anymore (it's gone), we retained the source records in XML format, and I confirmed that had that ambiguous page range in them. I also looked through our code and don't see any parsing issues, so I'm confident it's a problem with the source data. Even though most of those records had a PMID, we only use that for supplemental data (like the abstract) and not for automated data corrections, so that would explain why the page ranges weren't fixed.

For Dr Cohen, I went ahead and did some manual fixes to the page ranges from the Pubmed data so that his publications should now appear corrected. We will have to wait for the scheduled updates so the updated citations are pulled to your end too. Most of his publications had a full page range listed in Pubmed, but some had none or just a starting a page. I just adjusted Dr Cohen's publications pages to whatever was in Pubmed. There are of course likely to be other Sciencewire records like this, but it would be rather onerous to start tracking down all of these errors. These particular publications should be fixed for other authors though (if they are co-authors on the same publication) since we only have one publication record even if there are multiple co-authors at Stanford.

peetucket commented 3 years ago

How I fixed them

cap_profile_id = '4481'
author=Author.where(cap_profile_id: cap_profile_id).first

# search for publications with a question mark in the page ranges
author.publications.where('pages like "%-?"').each {|p| puts p.pub_hash[:provenance]}
=> sciencewire
=> sciencewire ....
# this showed all were sciencewire

# grab the pmids for these pubs
pmids = author.publications.where('pages like "%-?"').map {|p| p.pmid }

# grab page ranges from pubmed records
results = {}
pmids.each do |pmid|
   source_data = PubmedSourceRecord.find_by(pmid:pmid).source_data
  pages = /(<MedlinePgn>)\d+-\d+(<\/MedlinePgn>)/.match(source_data).to_s.gsub('<MedlinePgn>','').gsub('</MedlinePgn>','')
  results[pmid] = pages
 end

# here they are!
 results
=> {14187629=>"805-6",
 14211649=>"511-4",
 14342342=>"3123-31",
 18619370=>"397-406",
 4559594=>"2110-4",
 4580676=>"235-55",
 4866369=>"113-22",
 4902321=>"1273-7",
 4916548=>"557-75",
 4920496=>"671-87",
 4943172=>"510-6",
 5324393=>"521-7",
 5340634=>"1759-66",
 5341412=>"19-38",
 5559866=>"635-9",
 5719216=>"387-406",
 17880219=>"12368-9",
 16902919=>"694-7",
 20232871=>"4560-1",
 20507095=>"8232-3",
 19886623=>"16675-7",
 26310922=>"",
 4600381=>""}

# fix all of the publications now
results.each do |pmid,pages|
   next if pages.empty?
   pub = author.publications.find_by(pmid:pmid)
   pub.pub_hash[:pages] = pages
   pub.pub_hash[:journal][:pages] if pub.pub_hash[:journal]
   pub.update_formatted_citations
   pub.save
end;nil
peetucket commented 3 years ago

Fixes confirmed for this user.