sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Analysis: how are the WoS results and the MEDLINE results different at the field level #264

Closed peetucket closed 6 years ago

peetucket commented 6 years ago

This will help define if we need one or two different WoS Source Record tables. If the fields are different, we will likely new two source records with their own parsing of the different fields. If they have the same fields, it may be possible to use the same database table.

dazza-codes commented 6 years ago

Looking at some example records for Russ Altman, using the following query:

wos_client = WosClient.new(Settings.WOS.AUTH_CODE, :info);
wos_queries = WosQueries.new(wos_client);
records = wos_queries.search_by_name('Altman, Russ');
dazza-codes commented 6 years ago

An example WOS record, extracted using this utility method from PR #268

puts records.by_database['WOS'].first.to_xml
<REC r_id_disclaimer="ResearcherID data provided by Clarivate Analytics">
  <UID>WOS:000172527000005</UID>
  <static_data>
    <summary>
      <EWUID>
        <WUID coll_id="WOS"/>
        <edition value="WOS.SCI"/>
      </EWUID>
      <pub_info coverdate="NOV-DEC 2001" has_abstract="N" issue="6" pubmonth="NOV-DEC" pubtype="Journal" pubyear="2001" sortdate="2001-11-01" vol="16">
        <page begin="14" end="18" page_count="5">14-18</page>
      </pub_info>
      <titles count="6">
        <title type="source">IEEE INTELLIGENT SYSTEMS</title>
        <title type="source_abbrev">IEEE INTELL SYST</title>
        <title type="abbrev_iso">IEEE Intell. Syst.</title>
        <title type="abbrev_11">IEEE IN SYS</title>
        <title type="abbrev_29">IEEE INTELL SYST</title>
        <title type="item">Challenges for intelligent systems in biology</title>
      </titles>
      <names count="1">
        <name daisng_id="745091" seq_no="1" dais_id="10190039" reprint="Y" role="author">
          <display_name>Altman, RB</display_name>
          <full_name>Altman, RB</full_name>
          <wos_standard>Altman, RB</wos_standard>
          <first_name>RB</first_name>
          <last_name>Altman</last_name>
          <email_addr>russ.altman@stanford.edu</email_addr>
        </name>
      </names>
      <doctypes count="1">
        <doctype>Editorial Material</doctype>
      </doctypes>
      <publishers>
        <publisher>
          <address_spec addr_no="1">
            <full_address>10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USA</full_address>
            <city>LOS ALAMITOS</city>
          </address_spec>
          <names count="1">
            <name addr_no="1" role="publisher" seq_no="1">
              <display_name>IEEE COMPUTER SOC</display_name>
              <full_name>IEEE COMPUTER SOC</full_name>
            </name>
          </names>
        </publisher>
      </publishers>
    </summary>
    <fullrecord_metadata>
      <languages count="1">
        <language type="primary">English</language>
      </languages>
      <normalized_languages count="1">
        <language type="primary">English</language>
      </normalized_languages>
      <normalized_doctypes count="1">
        <doctype>Editorial</doctype>
      </normalized_doctypes>
      <refs count="12"/>
      <addresses count="1">
        <address_name>
          <address_spec addr_no="1">
            <full_address>Stanford Univ, Stanford, CA 94305 USA</full_address>
            <organizations count="2">
              <organization>Stanford Univ</organization>
              <organization pref="Y">Stanford University</organization>
            </organizations>
            <city>Stanford</city>
            <state>CA</state>
            <country>USA</country>
            <zip location="AP">94305</zip>
          </address_spec>
        </address_name>
      </addresses>
      <reprint_addresses count="1">
        <address_name>
          <address_spec addr_no="1">
            <full_address>Stanford Univ, Stanford, CA 94305 USA</full_address>
            <organizations count="2">
              <organization>Stanford Univ</organization>
              <organization pref="Y">Stanford University</organization>
            </organizations>
            <city>Stanford</city>
            <state>CA</state>
            <country>USA</country>
            <zip location="AP">94305</zip>
          </address_spec>
        </address_name>
      </reprint_addresses>
      <category_info>
        <headings count="1">
          <heading>Science &amp; Technology</heading>
        </headings>
        <subheadings count="1">
          <subheading>Technology</subheading>
        </subheadings>
        <subjects count="6">
          <subject ascatype="traditional" code="EP">Computer Science, Artificial Intelligence</subject>
          <subject ascatype="traditional" code="IQ">Engineering, Electrical &amp; Electronic</subject>
          <subject ascatype="extended">Computer Science</subject>
          <subject ascatype="extended">Engineering</subject>
          <subject ascatype="traditional" code="EP">COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE</subject>
          <subject ascatype="traditional" code="IQ">ENGINEERING, ELECTRICAL &amp; ELECTRONIC</subject>
        </subjects>
      </category_info>
    </fullrecord_metadata>
    <item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" coll_id="WOS" xsi:type="itemType_wos">
      <ids avail="N">498TR</ids>
      <bib_id>16 (6): 14-18 NOV-DEC 2001</bib_id>
      <bib_pagecount type="Journal">108</bib_pagecount>
      <keywords_plus count="1">
        <keyword>GENOME</keyword>
      </keywords_plus>
    </item>
  </static_data>
  <dynamic_data>
    <citation_related>
      <tc_list>
        <silo_tc coll_id="WOS" local_count="15"/>
      </tc_list>
    </citation_related>
    <cluster_related>
      <identifiers>
        <identifier type="issn" value="1541-1672"/>
        <identifier type="eissn" value="1941-1294"/>
        <identifier type="xref_doi" value="10.1109/5254.972065"/>
      </identifiers>
    </cluster_related>
  </dynamic_data>
</REC>
dazza-codes commented 6 years ago

An example MEDLINE record, extracted using this utility method from PR #268

puts records.by_database['MEDLINE'].first.to_xml
<REC r_id_disclaimer="ResearcherID data provided by Clarivate Analytics">
  <UID>MEDLINE:24551397</UID>
  <static_data>
    <summary>
      <EWUID>
        <WUID coll_id="MEDLINE"/>
        <edition value="MEDLINE.MEDLINE"/>
      </EWUID>
      <pub_info coverdate="2013" edate="2013-11-16" has_abstract="Y" medium="Internet" model="Electronic-eCollection" pubtype="Journal" pubyear="2013" sortdate="2013-01-01" vol="2013">
        <page begin="1123" end="32">1123-32</page>
      </pub_info>
      <titles count="4">
        <title type="item">Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics.</title>
        <title type="source">AMIA ... Annual Symposium proceedings. AMIA Symposium</title>
        <title type="abbrev_iso">AMIA Annu Symp Proc</title>
        <title type="source_abbrev">AMIA Annu Symp Proc</title>
      </titles>
      <names count="2">
        <name display="Y" role="author" seq_no="1">
          <display_name>Percha, Bethany</display_name>
          <full_name>Percha, Bethany</full_name>
          <initials>B</initials>
        </name>
        <name display="Y" role="author" seq_no="2">
          <display_name>Altman, Russ B</display_name>
          <full_name>Altman, Russ B</full_name>
          <initials>RB</initials>
        </name>
      </names>
      <doctypes count="5">
        <doctype>Comparative Study</doctype>
        <doctype>Journal Article</doctype>
        <doctype>Research Support, N.I.H., Extramural</doctype>
        <doctype>Research Support, Non-U.S. Gov't</doctype>
        <doctype>Research Support, U.S. Gov't, Non-P.H.S.</doctype>
      </doctypes>
    </summary>
    <fullrecord_metadata>
      <languages count="1">
        <language type="primary">English</language>
      </languages>
      <normalized_languages count="1">
        <language type="primary">English</language>
      </normalized_languages>
      <normalized_doctypes count="2">
        <doctype>Other</doctype>
        <doctype>Article</doctype>
      </normalized_doctypes>
      <addresses count="1">
        <address_name>
          <address_spec addr_no="1">
            <full_address>Stanford University, Stanford, CA.</full_address>
          </address_spec>
        </address_name>
      </addresses>
      <category_info>
        <headings count="1">
          <heading>Science &amp; Technology</heading>
        </headings>
        <subheadings count="3">
          <subheading>Technology</subheading>
          <subheading>Physical Sciences</subheading>
          <subheading>Life Sciences &amp; Biomedicine</subheading>
        </subheadings>
        <subjects count="10">
          <subject ascatype="traditional">INFORMATION SCIENCE LIBRARY SCIENCE</subject>
          <subject ascatype="traditional" code="PQ">MATHEMATICS</subject>
          <subject ascatype="traditional" code="PT">MEDICAL INFORMATICS</subject>
          <subject ascatype="traditional">COMPUTER SCIENCE INTERDISCIPLINARY APPLICATIONS</subject>
          <subject ascatype="traditional">PHARMACOLOGY PHARMACY</subject>
          <subject ascatype="extended">Information Science &amp; Library Science</subject>
          <subject ascatype="extended">Mathematics</subject>
          <subject ascatype="extended">Medical Informatics</subject>
          <subject ascatype="extended">Computer Science</subject>
          <subject ascatype="extended">Pharmacology &amp; Pharmacy</subject>
        </subjects>
      </category_info>
      <fund_ack>
        <grants complete="Y" count="6">
          <grant>
            <grant_agency>NIGMS NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>R24 GM061374</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>GM</acronym>
          </grant>
          <grant>
            <grant_agency>NIMH NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>P50 MH094267</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>MH</acronym>
          </grant>
          <grant>
            <grant_agency>NIMH NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>MH094267</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>MH</acronym>
          </grant>
          <grant>
            <grant_agency>NIGMS NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>GM61374</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>GM</acronym>
          </grant>
          <grant>
            <grant_agency>NLM NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>T15 LM007033</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>LM</acronym>
          </grant>
          <grant>
            <grant_agency>NIGMS NIH HHS</grant_agency>
            <grant_ids count="1">
              <grant_id>U01 GM061374</grant_id>
            </grant_ids>
            <country>United States</country>
            <acronym>GM</acronym>
          </grant>
        </grants>
      </fund_ack>
      <abstracts count="1">
        <abstract>
          <abstract_text>
            <p>The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology. </p>
          </abstract_text>
        </abstract>
      </abstracts>
    </fullrecord_metadata>
    <item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Owner="NLM" Status="MEDLINE" coll_id="MEDLINE" xsi:type="itemType_medline">
      <MedlineJournalInfo>
        <Country>United States</Country>
        <NlmUniqueID>101209213</NlmUniqueID>
        <ISSNLinking>1559-4076</ISSNLinking>
      </MedlineJournalInfo>
      <DateCreated>2014-02-19</DateCreated>
      <DateCompleted>2014-05-26</DateCompleted>
      <DateRevised>2016-10-19</DateRevised>
      <Affiliation>Stanford University, Stanford, CA.</Affiliation>
      <CitationSubset>IM</CitationSubset>
      <CommentsCorrectionsList>
        <CommentsCorrections RefType="Cites">
          <RefSource>Nucleic Acids Res. 2002 Jan 1;30(1):163-5</RefSource>
          <PMID Version="1">11752281</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>Psychol Rev. 2007 Jan;114(1):1-37</RefSource>
          <PMID Version="1">17227180</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>Bioinformatics. 2007 Feb 1;23(3):365-71</RefSource>
          <PMID Version="1">17142812</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>Pac Symp Biocomput. 2012;:410-21</RefSource>
          <PMID Version="1">22174296</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>J Biomed Inform. 2010 Apr;43(2):240-56</RefSource>
          <PMID Version="1">19761870</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>J Biomed Inform. 2010 Dec;43(6):1009-19</RefSource>
          <PMID Version="1">20723615</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>J Biomed Inform. 2011 Feb;44(1):163-79</RefSource>
          <PMID Version="1">20647054</PMID>
        </CommentsCorrections>
        <CommentsCorrections RefType="Cites">
          <RefSource>J Biomed Inform. 2009 Apr;42(2):390-405</RefSource>
          <PMID Version="1">19232399</PMID>
        </CommentsCorrections>
      </CommentsCorrectionsList>
      <MeshHeadingList>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N" UI="D000043">Abstracting and Indexing as Topic</DescriptorName>
          <QualifierName MajorTopicYN="Y" UI="Q000379">methods</QualifierName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="N">L01.453.245.100</TreeCode>
            <TreeCode MajorTopicYN="Y">Y33</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N" UI="D000465">Algorithms</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="N">G17.035</TreeCode>
            <TreeCode MajorTopicYN="N">L01.224.050</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="Y" UI="D064229">Biological Ontologies</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="Y">L01.224.050.375.480.500</TreeCode>
            <TreeCode MajorTopicYN="Y">L01.313.500.750.300.550.500</TreeCode>
            <TreeCode MajorTopicYN="Y">L01.453.245.945.079</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N" UI="D057225">Data Mining</DescriptorName>
          <QualifierName MajorTopicYN="Y" UI="Q000379">methods</QualifierName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="N">L01.313.500.750.280.199</TreeCode>
            <TreeCode MajorTopicYN="N">L01.470.625</TreeCode>
            <TreeCode MajorTopicYN="Y">Y33</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N" UI="D016239">MEDLINE</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="N">L01.313.500.750.280.710.500</TreeCode>
            <TreeCode MajorTopicYN="N">L01.313.500.750.280.750.500</TreeCode>
            <TreeCode MajorTopicYN="N">L01.313.500.750.300.188.300.650.500</TreeCode>
            <TreeCode MajorTopicYN="N">L01.313.500.750.300.710.500</TreeCode>
            <TreeCode MajorTopicYN="N">L01.313.500.750.300.742.650.500</TreeCode>
            <TreeCode MajorTopicYN="N">L01.470.750.500.650.500</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="N" UI="D009323">Natural Language Processing</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="N">L01.224.050.375.580</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="Y" UI="D010597">Pharmacogenetics</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="Y">H01.158.273.343.750</TreeCode>
            <TreeCode MajorTopicYN="Y">H01.158.703.052</TreeCode>
            <TreeCode MajorTopicYN="Y">H02.628.479</TreeCode>
          </TreeCodeList>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName MajorTopicYN="Y" UI="D012660">Semantics</DescriptorName>
          <TreeCodeList>
            <TreeCode MajorTopicYN="Y">L01.143.506.598.745</TreeCode>
          </TreeCodeList>
        </MeshHeading>
      </MeshHeadingList>
      <OtherID Source="NLM">PMC3900134</OtherID>
    </item>
  </static_data>
  <dynamic_data>
    <citation_related>
      <tc_list>
        <silo_tc coll_id="MEDLINE" local_count="0"/>
      </tc_list>
    </citation_related>
    <cluster_related>
      <identifiers>
        <identifier type="eissn" value="1942-597X"/>
        <identifier type="pmid" value="MEDLINE:24551397"/>
      </identifiers>
    </cluster_related>
  </dynamic_data>
</REC>
dazza-codes commented 6 years ago

A useful quick util method to extract the XML node names, recursively, into a nested hash:

def child_names(node)
  return if node.nil?  
  { node.name => node.children.map {|c| child_names(c) } }  
end  
medline_rec = records.by_database('MEDLINE').sample(1).first;
medline_rec.search('UID').text
#=> "MEDLINE:18229697"
medline_fields = child_names(medline_rec);
medline_fields.keys
# => ["REC"]
medline_fields['REC'].map(&:keys)
#=> [["UID"], ["static_data"], ["dynamic_data"]]

wos_rec = records.by_database('WOS').sample(1).first;
wos_rec.search('UID').text
#=> "WOS:000342763900023"
wos_fields = child_names(wos_rec);
wos_fields.keys
#=> ["REC"]
wos_fields['REC'].map(&:keys)
#=> [["UID"], ["static_data"], ["dynamic_data"]]
dazza-codes commented 6 years ago

See also http://ipscience-help.thomsonreuters.com/wosWebServicesExpanded/WebServicesExpandedOverviewGroup/Introduction/sampleResponse.html

Both records have a common framework:

<REC r_id_disclaimer="ResearcherID data provided by Clarivate Analytics">
  <UID/>
  <static_data>
    <summary>
      <EWUID>
        <WUID coll_id="{DATABASE}"/>
        <edition value="{EDITION}"/>
      </EWUID>
      <pub_info>
      </pub_info>
      <titles count="{N}">
        <title type="{type}">{text}</title>
      </titles>
      <names count="{N}">
        <name display="Y" role="author" seq_no="{N}">
          <display_name>{text}</display_name>
          <full_name>{text}</full_name>
          <initials>{text}</initials>
        </name>
      </names>
      <doctypes count="{N}">
        <doctype>{doctype text}</doctype>
      </doctypes>
    </summary>
    <fullrecord_metadata>
      {content may vary by database ??}
    </fullrecord_metadata>
  </static_data>
  <dynamic_data>
      <citation_related/>
      <cluster_related>
        <identifiers>
          <identifier type="XYZ" value="123"/>
        </identifiers>
    </cluster_related>
  </dynamic_data>
</REC>

The WOS record has additional data in the <summary> for publishers.

      <publishers>
        <publisher>
          <address_spec addr_no="{N}">
            <full_address>{text}</full_address>
            <city>{text}</city>
          </address_spec>
          <names count="{N}">
            <name addr_no="{N}" role="publisher" seq_no="{N}">
              <display_name>{text}</display_name>
              <full_name>{text}</full_name>
            </name>
          </names>
        </publisher>
      </publishers>

All of the WOS docs contained publishers, while none of the MEDLINE docs contained publishers, e.g.

> records.by_database('WOS').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/summary/publishers') }.count
=> 411
> records.by_database('WOS').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/summary/publishers') }.compact.count
=> 411
> records.by_database('MEDLINE').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/summary') }.compact.count
=> 53
> records.by_database('MEDLINE').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/summary/publishers') }.compact.count
=> 0

Similar XML queries were used to confirm the common set of XML elements for WOS and MEDLINE records.

dazza-codes commented 6 years ago

The <fullrecord_metadata> elements are not consistent, e.g. the <abstracts> element is not always available:

> records.rec_nodes.map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/fullrecord_metadata/abstracts') }.compact.count
=> 293 (of 464)
> records.by_database('WOS').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/fullrecord_metadata/abstracts') }.compact.count
=> 243 (of 411)
> records.by_database('MEDLINE').map {|rec| doc = Nokogiri::XML(rec.to_xml); doc.at('/REC/static_data/fullrecord_metadata/abstracts') }.compact.count
=> 50 (of 53)
dazza-codes commented 6 years ago

Analysis of these records could go on for a bit longer, but essentially there is evidence that both WOS and MEDLINE contain a common subset of /REC/static_data/summary data that is consistently available. The rest of the data, esp. in /REC/static_data/fullrecord_metadata can vary record-by-record (it’s not necessarily consistent across databases, although some databases may define fields that others do not).

We need additional :eyes: on these data fields and discussion of the implications for persisting the data using one or more SQL tables. Given that we store a chunk of XML-text anyways, I’m leaning toward a single WebOfScienceSourceRecord table. If we want to handle stuff differently, we could create different class wrappers for that purpose with convenience methods on WebOfScienceSourceRecord to return an object of the most useful record type (see e.g. #275, which could be a super-class to db-specific sub-classes). When storing the XML-text, we could extract the UID field and possibly a Database field that extracts the "namespace" or database-id from the UID prefix (this was merged in #269).