Closed peetucket closed 7 years ago
So far it seems that what we are taking from PubMed are : -the mesh_headings and -the abstract
see
# We process the PubMed harvesting in ScienceWireHarvester because the PubMed data
# supplements the ScienceWire data -- essentially we combine `SciencewireSourceRecord`
# and `PubmedSourceRecord` into the `Publication.pub_hash`. The `SciencewireSourceRecord`
# data include a PubMed ID (`pmid`) so that we can link the two records.
def process_queued_pubmed_records
return if @records_queued_for_pubmed_retrieval.empty?
begin
pubmed_source_record = PubmedSourceRecord.new
pub_med_records = @pubmed_client.fetch_records_for_pmid_list(@records_queued_for_pubmed_retrieval.keys)
Nokogiri::XML(pub_med_records).xpath('//PubmedArticle').each do |pub_doc|
pmid = pub_doc.xpath('MedlineCitation/PMID').text
pubmed_source_record = PubmedSourceRecord.create_pubmed_source_record(pmid, pub_doc)
@total_new_pubmed_source_count += 1 if pubmed_source_record
pub_hash = @records_queued_for_pubmed_retrieval[pmid][:sw_hash]
author_ids = @records_queued_for_pubmed_retrieval[pmid][:authors]
pub = create_new_harvested_pub(pub_hash[:sw_id], pmid)
abstract = pubmed_source_record.extract_abstract_from_pubmed_record(pub_doc)
mesh = pubmed_source_record.extract_mesh_headings_from_pubmed_record(pub_doc)
pub_hash[:mesh_headings] = mesh unless mesh.blank?
pub_hash[:abstract] = abstract unless abstract.blank?
create_contribs_for_author_ids_and_pub(author_ids, pub)
pub.pub_hash = pub_hash
pub.sync_publication_hash_and_db
pub.save
end
rescue => e
NotificationManager.error(e, 'PubMed harvesting failed', self)
end
@records_queued_for_pubmed_retrieval.clear
end
in ScienceWireHarvester.
Path is
1 - sw.rake 1.1 - task :fortnightly_harvest -> harvester.harvest_pubs_for_all_authors(starting_author_id, ending_author_id)
2 - ScienceWireHarvester 2.1 - def harvest_pubs_for_all_authors(starting_author_id, ending_author_id = -1) -> harvest_pubs_for_authors -> process_queued_pubmed_records
This does not account for the harvesting triggered by a user input from the GUI.
Found this method too which is very aptly named "add_any_pubmed_data_to_hash":
https://github.com/sul-dlss/sul_pub/blob/master/app/models/publication.rb#L298-L307
This shows we are adding:
Looking at the example MEDLINE:24551397
record in https://github.com/sul-dlss/sul_pub/issues/264#issuecomment-335582833, it does contain <MeshHeadingList>
and an abstract
<abstracts count="1">
<abstract>
<abstract_text>
<p>The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology. </p>
</abstract_text>
</abstract>
</abstracts>
The identifiers it contains is the UID noted above and:
<OtherID Source="NLM">PMC3900134</OtherID>
<identifiers>
<identifier type="eissn" value="1942-597X"/>
<identifier type="pmid" value="MEDLINE:24551397"/>
</identifiers>
Is the <OtherID Source="NLM">PMC3900134</OtherID>
the PMCID?
that definitely looks like the PMCID...
So the MEDLINE only records have this data, another question is what do the WoS records that were merged with a MEDLINE record have? Because those are the records that we currently go to Pubmed to supplement. We'd like to analyze some records that we know are merged on the Web of Science side (see #257) but I haven't heard back from Rob yet. I'll ping him again to ask for examples.
We should identify in the code which metadata fields from the Pubmed record are being merged into the pub_hash for records with a PMID
This is helping us decide if we have enough data in the new WoS records, i.e. are we getting enough data from the WoS API that we do not need to supplement from a PMID call.