sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
8 stars 3 forks source link

Try to characterize publication_type in pub_hash better for Pubmed records #1090

Open peetucket opened 5 years ago

peetucket commented 5 years ago

Currently for Pubmed Source Records we set all publication types as article (see It is possible the pubmed source records has information about the type that could be used to set it better as one of the following supported types:

- inproceedings
- book
- article

For example, the pubmed source XML has a node called that looks like this:

                <PublicationType UI="D016428">Journal Article</PublicationType>

which suggests it may hold publication type.

e.g. in prod, see puts PubmedSourceRecord.find_by(pmid:27397405).source_data

peetucket commented 5 years ago

Current values in pubmed source records:

total = PubmedSourceRecord.count
n = 0
PubmedSourceRecord.find_each do |pmsr|
    n += 1
    pub_doc = Nokogiri::XML(pmsr.source_data)
      article_type = pub_doc.xpath('//PubmedArticle/MedlineCitation/Article/PublicationTypeList/PublicationType')[0].children[0].text
      article_type = "NODE_NOT_FOUND"
    pub_types[article_type] += 1
    puts "#{n} of #{total} : #{article_type}"
puts total
=> 423676
puts pub_types.sort_by {|_key, value| - value}.to_h
=> {"Journal Article"=>333665,
 "Comparative Study"=>23977,
 "Case Reports"=>19433,
 "Clinical Trial"=>8464,
 "English Abstract"=>3698,
 "Evaluation Studies"=>3346,
 "In Vitro"=>2361,
 "Historical Article"=>982,
 "Clinical Trial, Phase II"=>928,
 "Clinical Trial, Phase III"=>707,
 "Clinical Trial, Phase I"=>651,
 "Consensus Development Conference"=>473,
 "Published Erratum"=>416,
 "Controlled Clinical Trial"=>287,
 "Introductory Journal Article"=>237,
 "Clinical Study"=>46,
 "Clinical Trial, Phase IV"=>41,
 "Consensus Development Conference, NIH"=>38,
 "Retraction of Publication"=>30,
 "Clinical Conference"=>24,
 "Newspaper Article"=>22,
 "Corrected and Republished Article"=>21,
 "Research Support, Non-U.S. Gov't"=>16,
 "Classical Article"=>15,
 "Clinical Trial, Veterinary"=>13,
 "Research Support, N.I.H., Extramural"=>11,
 "Duplicate Publication"=>11,
 "Interactive Tutorial"=>11,
 "Patient Education Handout"=>11,
 "Legal Case"=>11,
 "Clinical Trial Protocol"=>8,
 "Research Support, U.S. Gov't, P.H.S."=>7,
 "Equivalence Trial"=>6,
 "Personal Narrative"=>5,
 "Systematic Review"=>4,
 "Practice Guideline"=>3,
 "Research Support, U.S. Gov't, Non-P.H.S."=>3,
 "Technical Report"=>2,
 "Legal Cases"=>1,
 "Video-Audio Media"=>1,
 "Adaptive Clinical Trial"=>1}
peetucket commented 5 years ago

Pubmed Documented publication types:

It's unclear from this controlled vocabulary what we would map conference proceedings and books to though