sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Try to characterize publication_type in pub_hash better for Pubmed records #1090

Open peetucket opened 5 years ago

peetucket commented 5 years ago

Currently for Pubmed Source Records we set all publication types as article (see https://github.com/sul-dlss/sul_pub/blob/master/app/models/pubmed_source_record.rb#L140). It is possible the pubmed source records has information about the type that could be used to set it better as one of the following supported types:

- inproceedings
- book
- article

For example, the pubmed source XML has a node called that looks like this:

           <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>

which suggests it may hold publication type.

e.g. in prod, see puts PubmedSourceRecord.find_by(pmid:27397405).source_data

peetucket commented 5 years ago

Current values in pubmed source records:

total = PubmedSourceRecord.count
pub_types=Hash.new(0)
n = 0
PubmedSourceRecord.find_each do |pmsr|
    n += 1
    pub_doc = Nokogiri::XML(pmsr.source_data)
    begin
      article_type = pub_doc.xpath('//PubmedArticle/MedlineCitation/Article/PublicationTypeList/PublicationType')[0].children[0].text
    rescue
      article_type = "NODE_NOT_FOUND"
    end
    pub_types[article_type] += 1
    puts "#{n} of #{total} : #{article_type}"
end;nil
puts total
=> 423676
puts pub_types.sort_by {|_key, value| - value}.to_h
=> {"Journal Article"=>333665,
 "Comparative Study"=>23977,
 "Case Reports"=>19433,
 "Clinical Trial"=>8464,
 "JOURNAL ARTICLE"=>5703,
 "Comment"=>5322,
 "Letter"=>4609,
 "Editorial"=>4526,
 "English Abstract"=>3698,
 "Evaluation Studies"=>3346,
 "In Vitro"=>2361,
 "Historical Article"=>982,
 "Clinical Trial, Phase II"=>928,
 "Clinical Trial, Phase III"=>707,
 "Biography"=>690,
 "Clinical Trial, Phase I"=>651,
 "NODE_NOT_FOUND"=>612,
 "Consensus Development Conference"=>473,
 "News"=>468,
 "Published Erratum"=>416,
 "Congresses"=>375,
 "Review"=>325,
 "Controlled Clinical Trial"=>287,
 "Guideline"=>277,
 "Introductory Journal Article"=>237,
 "Congress"=>176,
 "Interview"=>143,
 "REVIEW"=>135,
 "LETTER"=>46,
 "Clinical Study"=>46,
 "Autobiography"=>45,
 "Clinical Trial, Phase IV"=>41,
 "Lectures"=>38,
 "Addresses"=>38,
 "Consensus Development Conference, NIH"=>38,
 "Bibliography"=>35,
 "Address"=>33,
 "EDITORIAL"=>32,
 "Retraction of Publication"=>30,
 "Clinical Conference"=>24,
 "Dataset"=>23,
 "Newspaper Article"=>22,
 "Corrected and Republished Article"=>21,
 "Research Support, Non-U.S. Gov't"=>16,
 "Classical Article"=>15,
 "Lecture"=>15,
 "Clinical Trial, Veterinary"=>13,
 "Research Support, N.I.H., Extramural"=>11,
 "Duplicate Publication"=>11,
 "Interactive Tutorial"=>11,
 "Patient Education Handout"=>11,
 "Legal Case"=>11,
 "Directory"=>9,
 "Clinical Trial Protocol"=>8,
 "Research Support, U.S. Gov't, P.H.S."=>7,
 "Equivalence Trial"=>6,
 "Personal Narrative"=>5,
 "Systematic Review"=>4,
 "Practice Guideline"=>3,
 "Research Support, U.S. Gov't, Non-P.H.S."=>3,
 "Legislation"=>3,
 "Festschrift"=>2,
 "Meta-Analysis"=>2,
 "Dictionary"=>2,
 "Overall"=>2,
 "PUBLISHED ERRATUM"=>2,
 "Technical Report"=>2,
 "Legal Cases"=>1,
 "CASE REPORTS"=>1,
 "Video-Audio Media"=>1,
 "Adaptive Clinical Trial"=>1}
peetucket commented 5 years ago

Pubmed Documented publication types:

It's unclear from this controlled vocabulary what we would map conference proceedings and books to though