Analysis: evaluate which fields from the PubMed Record are merged or overridden in the Sciencewire/WoS Record

peetucket commented 7 years ago

We should identify in the code which metadata fields from the Pubmed record are being merged into the pub_hash for records with a PMID

This is helping us decide if we have enough data in the new WoS records, i.e. are we getting enough data from the WoS API that we do not need to supplement from a PMID call.

Maatary commented 7 years ago

So far it seems that what we are taking from PubMed are : -the mesh_headings and -the abstract

see

# We process the PubMed harvesting in ScienceWireHarvester because the PubMed data
    # supplements the ScienceWire data -- essentially we combine `SciencewireSourceRecord`
    # and `PubmedSourceRecord` into the `Publication.pub_hash`. The `SciencewireSourceRecord`
    # data include a PubMed ID (`pmid`) so that we can link the two records.
    def process_queued_pubmed_records
      return if @records_queued_for_pubmed_retrieval.empty?
      begin
        pubmed_source_record = PubmedSourceRecord.new
        pub_med_records = @pubmed_client.fetch_records_for_pmid_list(@records_queued_for_pubmed_retrieval.keys)
        Nokogiri::XML(pub_med_records).xpath('//PubmedArticle').each do |pub_doc|
          pmid = pub_doc.xpath('MedlineCitation/PMID').text
          pubmed_source_record = PubmedSourceRecord.create_pubmed_source_record(pmid, pub_doc)
          @total_new_pubmed_source_count += 1 if pubmed_source_record
          pub_hash = @records_queued_for_pubmed_retrieval[pmid][:sw_hash]
          author_ids = @records_queued_for_pubmed_retrieval[pmid][:authors]
          pub = create_new_harvested_pub(pub_hash[:sw_id], pmid)
          abstract = pubmed_source_record.extract_abstract_from_pubmed_record(pub_doc)
          mesh = pubmed_source_record.extract_mesh_headings_from_pubmed_record(pub_doc)

          pub_hash[:mesh_headings] = mesh unless mesh.blank?
          pub_hash[:abstract] = abstract unless abstract.blank?

          create_contribs_for_author_ids_and_pub(author_ids, pub)
          pub.pub_hash = pub_hash
          pub.sync_publication_hash_and_db
          pub.save
        end
      rescue => e
        NotificationManager.error(e, 'PubMed harvesting failed', self)
      end
      @records_queued_for_pubmed_retrieval.clear
    end

in ScienceWireHarvester.

Path is

1 - sw.rake 1.1 - task :fortnightly_harvest -> harvester.harvest_pubs_for_all_authors(starting_author_id, ending_author_id)

2 - ScienceWireHarvester 2.1 - def harvest_pubs_for_all_authors(starting_author_id, ending_author_id = -1) -> harvest_pubs_for_authors -> process_queued_pubmed_records

This does not account for the harvesting triggered by a user input from the GUI.

peetucket commented 7 years ago

Found this method too which is very aptly named "add_any_pubmed_data_to_hash":

https://github.com/sul-dlss/sul_pub/blob/master/app/models/publication.rb#L298-L307

This shows we are adding:

MESH (medical term headings)
abstract
PMCID (an identifier that many if not all MEDLINE articles have -- see https://nexus.od.nih.gov/all/2015/08/31/pmid-vs-pmcid-whats-the-difference/ for the difference)

dazza-codes commented 7 years ago

Looking at the example MEDLINE:24551397 record in https://github.com/sul-dlss/sul_pub/issues/264#issuecomment-335582833, it does contain <MeshHeadingList> and an abstract

<abstracts count="1">
        <abstract>
          <abstract_text>
            <p>The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology. </p>
          </abstract_text>
        </abstract>
      </abstracts>

The identifiers it contains is the UID noted above and:

      <OtherID Source="NLM">PMC3900134</OtherID>

      <identifiers>
        <identifier type="eissn" value="1942-597X"/>
        <identifier type="pmid" value="MEDLINE:24551397"/>
      </identifiers>

Is the <OtherID Source="NLM">PMC3900134</OtherID> the PMCID?

peetucket commented 7 years ago

that definitely looks like the PMCID...

So the MEDLINE only records have this data, another question is what do the WoS records that were merged with a MEDLINE record have? Because those are the records that we currently go to Pubmed to supplement. We'd like to analyze some records that we know are merged on the Web of Science side (see #257) but I haven't heard back from Rob yet. I'll ping him again to ask for examples.

sul-dlss / sul_pub

Analysis: evaluate which fields from the PubMed Record are merged or overridden in the Sciencewire/WoS Record #256